
Member-only story
How to use Variables and XCom in Apache Airflow?
It is said that Apache Airflow is CRON on steroids. It is gaining popularity among tools for ETL orchestration (Scheduling, managing and monitoring tasks). The tasks are defined as Directed Acyclic Graph (DAG), in which they exchange information. In the entry you will learn how to use Variables and XCom in Apache Airflow.
The Environment
In case of Apache Airflow, the puckel/docker-airflow version works well. Most often I use docker-compose-LocalExecutor.yml
variant.
sudo docker-compose -f docker-compose-LocalExecutor.yml up -d
Why Variables and XCom?
Variables and XCom are like variables used within the Apache Airflow environment.
Variables are a kind of global variable. If a value is used by many DAGs (and you don’t want to edit N files if you change it), consider adding it to Variables.
XCom (from cross-communication) are messages allowing to exchange data between tasks. They are defined as key, value, timestamp and task/DAG id. Any object that can be pickled
can be used as XCom.
Let’s say that you wrote an application in Apache Spark, which saves the effects of the work in a directory on HDFS/S3. The path must be given as an argument for this application, and it is generated by the Python script, which your teammate wrote. Later on, subsequent tasks download this data and make something with it. A parameter with a path circulates everywhere. This is what XCom is for 😁.
WARNING! Do not use XCom to send a lot of data! These values are stored in the database used by Apache Airflow. Throwing giant Numpy objects is a common case of incorrect use of XCom.
Example
DAG’s tasks are simple:
- Download (and if it does not exist, generate) a value from Variables
- Create another value from it and add to XCom
- Iterate the Variables value and save it
- Download the date with BashOperator and add it to XCom
- Display both values in the console on the remote machine using SSHOperator