Building (Better) Data Pipelines using Apache Airflow
Sid Anand (@r39132) QCon.AI 2018
�1
About Me
�2
Work [ed | s] @
Maintainer of
Spare time
Co-Chair for
Apache Airflow
�3
What is it?
�4
Apache Airflow : What is it?
In a :
Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs or Directed Acyclic Graphs)
Apache Airflow
�5
UI Walk-Through
�6
Apache Airflow : UI Walk-through
Airflow - Authoring DAGs
�7
Airflow: Visualizing a DAG
�8
Airflow: Author DAGs in Python! No need to bundle many XML files!
Airflow - Authoring DAGs
�9
Airflow: The Tree View offers a view of DAG Runs over time!
Airflow - Authoring DAGs
Airflow - Performance Insights
�10
Airflow: Gantt charts reveal the slowest tasks for a run!
�11
Airflow: …And we can easily see performance trends over time
Airflow - Performance Insights
Apache Airflow
�12
Why use it?
�13
Apache Airflow : Why use it?When would you use a Workflow Scheduler like Airflow?
• ETL Pipelines
• Machine Learning Pipelines
• Predictive Data Pipelines • Fraud Detection, Scoring/Ranking, Classification,
Recommender System, etc…
• General Job Scheduling (e.g. Cron) • DB Back-ups, Scheduled code/config deployment
�14
What should a Workflow Scheduler do well? • Schedule a graph of dependencies
• where Workflow = A DAG of Tasks
• Handle task failures
• Report / Alert on failures
• Monitor performance of tasks over time
• Enforce SLAs • E.g. Alerting if time or correctness SLAs are not met
• Easily scale for growing load
Apache Airflow : Why use it?
�15
What Does Apache Airflow Add?
• Configuration-as-code
• Usability - Stunning UI / UX
• Centralized configuration
• Resource Pooling
• Extensibility
Apache Airflow : Why use it?
Use-Case : Message ScoringBatch Pipeline Architecture
�16
Use-Case : Message Scoring
�17
enterprise Aenterprise Benterprise C
S3
S3 uploads every 15 minutes
Use-Case : Message Scoring
�18
enterprise Aenterprise Benterprise C
S3
Airflow kicks of a Spark message scoring job
every hour
Use-Case : Message Scoring
�19
enterprise Aenterprise Benterprise C
S3
Spark job writes scored messages and stats to
another S3 bucket
S3
Use-Case : Message Scoring
�20
enterprise Aenterprise Benterprise C
S3
This triggers SNS/SQS messages events
S3
SNS
SQS
Use-Case : Message Scoring
�21
enterprise Aenterprise Benterprise C
S3
An Autoscale Group (ASG) of Importers spins up when it detects SQS
messages
S3
SNS
SQS
Importers
ASG
�22
enterprise Aenterprise Benterprise C
S3
The importers rapidly ingest scored messages and aggregate statistics into
the DB
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
�23
enterprise Aenterprise Benterprise C
S3
Users receive alerts of untrusted emails & can review them in
the web app
S3
SNS
SQS
Importers
ASGDB
Use-Case : Message Scoring
�24
enterprise Aenterprise Benterprise C
S3 S3
SNS
SQS
Importers
ASGDB
Airflow manages the entire process
Use-Case : Message Scoring
�25
Airflow DAG
Apache Airflow
�26
Incubating
�27
Apache Airflow : Incubating
Timeline • Airflow was created @ Airbnb in 2015 by Maxime
Beauchemin • Max launched it @ Hadoop Summit in Summer 2015 • On 3/31/2016, Airflow —> Apache Incubator
Today • 2400+ Forks • 7600+ GitHub Stars • 430+ Contributors • 150+ companies officially using it! • 14 Committers/Maintainers <— We’re growing here
Thank You!
�28
Apache Airflow
�29
Behind the Scenes
�30
Airflow is a platform to programmatically author, schedule and monitor workflows (a.k.a. DAGs)
It ships with a • DAG Scheduler • Web application (UI) • Powerful CLI • Celery Workers!
Apache Airflow : Behind the Scenes
�31
Apache Airflow : Behind the ScenesWebserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages DAGs using the Airflow UI!
2. Airflow’s webserver stores scheduling metadata in the metadata DB
3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ
4. Airflow workers pick up Airflow tasks over Celery
Celery / RabbitMQ
�32
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages DAGs using the Airflow UI!
2. Airflow’s webserver stores scheduling metadata in the metadata DB
3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ
4. Airflow workers pick up Airflow tasks over Celery
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
1. A user schedules / manages DAGs using the Airflow UI!
2. Airflow’s webserver stores scheduling metadata in the metadata DB
3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ
4. Airflow workers pick up Airflow tasks over Celery
�33
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
�34
Webserver
Scheduler
WorkerWorkerWorker
Meta DB
1. A user schedules / manages DAGs using the Airflow UI!
2. Airflow’s webserver stores scheduling metadata in the metadata DB
3. The scheduler picks up new schedules and distributes work over Celery / RabbitMQ
4. Airflow workers pick up Airflow tasks from RabbitMQ
Celery / RabbitMQ
Apache Airflow : Behind the Scenes
Thank You!
�35