Skip to content

Apache Airflow

Apache Airflow Notes

Apache Airflow : A batch oriented framework for building data pipelines.

Its key feature is that it eanbles you to easily build scheduled data pipelines using python.

Airflow is not a data processing tool in itself, but orchestrates the different components responsible for processing your data in data pipelines.

Say we want to implement the following :

  1. Fetech weather forecast data form a weather api.
  2. Clean or otherwise transform the fetched data.
  3. push the transformed data to the weather dashboard.

Data pipelines as graphs.

One way to make dependencies between tasks is to draw the data pipeline as a graph. In this graph based representation, tasks are represented as nodes in the graph and the direction represents the flow of the orchestration.

This type of graph is typically called a directed acyclic graph ( DAG ), as the graph contains directed edges and does not contain any loops or cycles.

Pipeline graphs vs sequential scripts

  1. Prepare the sales data by doing the following.

    fetching sales data from the source system. cleaning / transforming the sales data to fit requirements.

  2. prepare weather data by doing the following.

    fetching the weather forecast data from the api. cleaning / transforming the weather data to fit requirements.

Introducting Airflow

In airflow you define your DAG's using python code. DAG Files which are essentially python scrpts that describe the structure of the corresponding dag.

Scheduling and executing pipelines.

To see how airflow executes your DAG's, lets briefly look at the overall process involved in developing and running airflow dag's.

airflow dag processor : parses dag and serializes them into the airflow metastore.

airflow scheduler : check's dag schedule.

airflow workers : picks up tasks that are scheduled for execution and executes them.

airflow triggerer : checks task completion status for tasks that support asynchronous processing.

airflow api server : serves as the main interface for users to visualize dag's.

Airflow is primarily designed for batched data inserts, so using airflow for streaming pipelines may not be a good idea.

Anatomy of an Airflow DAG