Airflow is a data orchestration and scheduling platform, in layman’s its a tool to manage your data-flows and data operations. It enables better management of what would have otherwise have been created through a cron job. Airflow revolves around the concept of directed acyclic graph (DAGs), a collection of tasks that are organized in directional manner handling their dependencies.
Airflow offers a management interfaces showcasing the status of every dag job run, whether it succeeded, failed, running or stuck on a retry mechanism.
It is possible to deep dive into the status of different tasks of the DAG, above for instance is tasks to pull data on sponsored products from Amazon’s Ads API for a few European marketplaces. Each marketplace has its own set of tasks periodically run. Airflow also provides the possibility to get alerted on failure or missed SLA.
The easiest way to setup airflow is through one of the docker images from puckel, you can use one of the docker compose yml files in order to setup an environment.
With docker-compose, setting an environment for development purpose is as easy as “docker-compose up”.
Airflow allows for a choice of executor, Local or Celery. There are some limitation in terms of what a local executor can do.
The celery executor allows for the dispatch of tasks across multiple “workers”, instances meant to process the different DAGs and tasks.
Local executors on the other hand provide for an integrated solution that let you run the different airflow components; front end, scheduler, worker, … on a single instance.
It is possible to setup Airflow on AWS based on the docker-compose files mentioned before and Amazon Elastic Container Service. In essence what is needed to set up Airflow is to set up container instance of puckel/Airflow with different environment variables and different command execution.
Rather than using the default Redis and Postgres images provided within the puckel’s docker-compose file, it is better to use Amazon’s internal services for Redis and RDS Postgres as managed services.
On Azure it is possible to host your own, version of the Airflow container, and launch it as part of a container instance, app service or within Azure Kubernetes service.
The setup of dags in airflow is built across 3 main concepts within it, operators, sensors and dependencies. Together they allow for building programmatically sets of tasks and their relationships and interdependencies.
Operators are wrapper around the specific code of the tasks that you wish to execute. They can be used to wrap around plan code in different languages like python and php or execution steps to fetch data from an FTP or move files to a data storage such as HDFS.
Sensors are a specific type of operator which role is to check if certain conditions have been met, this can be the case for checking that a file has been placed in a FTP folder, that a partition in a database has been created, …
Airflow allows for defining dependencies between tasks and only execute tasks provided the upstream dependencies have been met. This is done by the set upstream and set downstream functions or through bitshift operators (<< and >>).
Airflow allows for the management of what should be done if a dependency is not fully met. This can be done by setting the trigger rules of the different tasks (all success, all failed, all done, one success…) within an operator.
Airflow provides tools that make it easier to manage data-flows and data processing steps in an integrated manner. It is built with a heavy engineering mindset and the pipeline definitions in it is built as code. Airflow makes it possible to handle programmatically the generation of pipelines.
A set of docker container exists to make it easy to set up both as a development environment and on the cloud. The creation of dags for data flows and processing purposes does require some experience and knowledge but is fairly easy to pick up and start developing upon it.