Scalable Pipelines

Post on 21-Feb-2017

87 views 0 download

transcript

Scalable Pipelines

Vivek NagarajanInsight Data Engineering Consulting Project

My Role

• Reduce latency of running a pipeline• Setup infrastructure for scaling pipelines

Pre-Pipeline Stage

input: <file to upload>

output: <file to output>

transforms:

split on newline

filter record by key

<filename, yaml>

Pre-Pipeline Stage

My ETL Pipeline

Scaling Pipeline

Schedule

Scaling Pipeline

Scaling Pipeline

Scaling Pipeline

Demo

Airflow web server link: http://vivek-airlflow-pipeline.us/

Challenges

• Understand existing framework and infrastructure

• Evolving set of requirements• Quirks of scaling pipelines in distributed

Airflow

Performance Stats

• Reduced time taken to process pipeline by over 50 percent

• Running 30 pipelines concurrently takes an average of 2 minutes per pipeline

Possible extensions

• Setting up HA on Flink cluster • Benchmarking with Spark transformations• Setting up multi-node Redis cluster• More support for dynamic transformations

About Me

Thank You