Building Fast, Scalable Machine Learning Pipelines
Vlad GivertsSr Director of Software Engineering, Workday
Identified Recruit
Identified Recruit
HDFS
Identified Recruit
Web Crawlers
HDFS
Identified Recruit
Web Crawlers
HDFSMR1
Hadoop
Identified Recruit
Web Crawlers
HDFSMR1
DataPipeline
Hadoop
Identified Recruit
Web Crawlers
HDFSMR1
DataPipeline
Hadoop
Solr
Identified Recruit
Web Crawlers
HDFSMR1
DataPipeline
Hadoop
Solr
16
Facebook Data
Identified Data Pipeline 1.0
17
ParseFacebook Data
Identified Data Pipeline 1.0
18
Parse NormalizeFacebook Data
Identified Data Pipeline 1.0
19
Parse NormalizeFacebook Data
Identified Data Pipeline 1.0
Index
20
Parse NormalizeFacebook Data
Identified Data Pipeline 1.0
Index Publish
Facebook Data
Identified Data Pipeline 2.0
Twitter Data
DoximityData
ParseFacebook Data
Identified Data Pipeline 2.0
ParseTwitter Data
ParseDoximityData
Parse NormalizeFacebook Data
Identified Data Pipeline 2.0
Parse NormalizeTwitter Data
Parse NormalizeDoximityData
Parse NormalizeFacebook Data
Identified Data Pipeline 2.0
Parse NormalizeTwitter Data
Parse NormalizeDoximityData
Merge (ML)
Parse NormalizeFacebook Data
Identified Data Pipeline 2.0
IndexParse NormalizeTwitter
Data
Parse NormalizeDoximityData
Merge (ML)
Parse NormalizeFacebook Data
Identified Data Pipeline 2.0
Index
Publish
Parse NormalizeTwitter Data
Parse NormalizeDoximityData
Merge (ML)
Retention Risk
Retention Risk
ElasticSearch
Retention Risk
HDFS
ElasticSearch
Retention Risk
HDFS
ElasticSearch
Kafka
Retention Risk
Spark
HDFS
ElasticSearch
YARN
Kafka
Indexing
Retention Risk
Spark
HDFS
ElasticSearch
YARN
MLPipeline
Kafka
Indexing
Retention Risk
Spark
HDFS
ElasticSearch
YARN
MLPipeline
Kafka
Indexing
ML Pipeline
Snapshot Data
ML Pipeline
Snapshot Data
Feature Extraction
Data and “Features”
Tenure
Time in Current Function
Pay Range Penetration
Manager Attrition Rate
Num Promotions
Avg Time Between Promotions
ML Pipeline
Snapshot Data
Feature Extraction
ML Pipeline
Feature Extraction
Model Training
Snapshot Data
ML Pipeline
Feature Extraction
Model Training
Model Validation
Snapshot Data
Training and Validation
BarryRaise: $1,000
2014 2016
RaviLeft :(
JohnLeft :(
AlbertPromoted!
YuryHired
TejasChanged Teams
Training and Validation
Q1 ‘14 Q2 ‘14 Q3 ‘14 Q4 ‘14 Q1 ‘15 Q2 ‘15
Training and Validation
Q1 ‘14 Q2 ‘14 Q3 ‘14 Q4 ‘14 Q1 ‘15 Q2 ‘15
TRAINING VALIDATION
ML Pipeline
Feature Extraction
Model Training
Model Validation
Snapshot Data
ML Pipeline
Feature Extraction
Model Training
Model Validation
Snapshot Data
Evaluation
Evaluation
Q1 ‘14 Q2 ‘14 Q3 ‘14 Q4 ‘14 Q1 ‘15 Q2 ‘15
Evaluation
Q1 ‘14 Q2 ‘14 Q3 ‘14 Q4 ‘14 Q1 ‘15 Q2 ‘15 Q3 ‘15 Q4 ‘15
Evaluation
Q1 ‘14 Q2 ‘14 Q3 ‘14 Q4 ‘14 Q1 ‘15 Q2 ‘15 Q3 ‘15 Q4 ‘15
PREDICTION
ML Pipeline
Feature Extraction
Model Training
Model Validation
Snapshot Data
Evaluation
ML Pipeline
Feature Extraction
Model Training
Model Validation
Snapshot Data
Evaluation Publish Results