Date post: | 23-Jan-2018 |
Category: |
Data & Analytics |
Upload: | traveloka |
View: | 202 times |
Download: | 0 times |
Traveloka Data
Meetup v1.0.0
How to Feed a Data Hungry Organization
Part One
Traveloka Data Culture
Part 1: Traveloka Data Culture
Five Characteristics of Data Hungry Organization
Driven Decision
Learn from Mistakes
Better Understanding
Uncertainty and Variation
High Quality Data
Data Hungry Organization
Part 1: Traveloka Data Culture
Our responsibility is to turn data into consumable insights
DATA
TEAM
BETTER
BUSINESS
DECISION
Part 1: Traveloka Data Culture
We need the brightest people to fill our needs and create the future
Mathematics
Business
Programming
Skills
Part 1: Traveloka Data Culture
Some of the skills in mathematics
Mathematics
Optimization
Decision Theory
Statistics
Differential Equations
Time Series
Part 1: Traveloka Data Culture
Some of the skills in business
Business
Strategy
Finance
Economics
Part 1: Traveloka Data Culture
Some of the skills in programming
Programming
Data Wrangling
Modelling
Big Data
Part 1: Traveloka Data Culture
This is how we structure our team
Data
TeamData Governance
Machine Learning Engineering
Data Analysis
Data Science
Data Engineering
Part 1: Traveloka Data Culture
Houston,
We have
a problem.
DW
Tens of Terabytes
Hundreds of ETLs
Kafka
Hundreds of topics
Millions of Messages per Hour
Hundreds of Megabytes per Second
S3
Hundreds of Terabytes
Redshift
Tens of Thousand Queries Daily
DOMO
Thousands of Cards
Hundreds of Users
PeriscopeData
Thousands of Dashboards
Hundreds of Users
Part 1: Traveloka Data Culture
We need
state of the art
technology
to feed data
hungry people
Ingestion
Gobblin
Data Lake
AWS S3
Batch Processing
Spark, Airflow, Hadoop2,
Python, Java App
Data Warehouse
Redshift, MongoDB,
PostgreSQL
Datahub
Pubsub, Kafka Stream Processing
DataFlow, MemSQL
Pipeline
Near Real Time DW
GCP BigQuery, MemSQL
Real Time DB
AWS DynamoDB
Ingestion Processin
g
Storage Presentation
Source DB
Mongo, PostgreSQL
App / Services
Java App
Analytics Tools
PeriscopeData, Spark, R,
Domo Dataiku Holistics, Keboola
ML Tools, Library, and Services
Jupyter, Zeppelin, Caffe, DataDog,
TensorFlow, Cloud Vision API
Query Engine
Qubole, Presto,
Hive
Part Two
Data Engineering
Part 2: Data Engineering
Fast Food,
Or…?
Part 2: Data Engineering
MINDSETS
Managed service
for focus
So we could focus more on
the use cases
Part 2: Data Engineering
MINDSETS
Managed service
for focus
So we could focus more on
the use cases
Part 2: Data Engineering
Real Time Pipeline
5 min data delivery SLA. Real latency ~ 10s
100 ms query SLA. Real latency ~ 10ms (p95)
Key value data, query by service/app
Autoscale - Self service for each engineering teamwe provide governance, guidance, building blocks, and consultation
Part 2: Data Engineering
Real
Time
Pipeline
Part 2: Data Engineering
Near Real Time Pipeline
Raw data, query by BI Tools
5 min data delivery SLA. Real latency ~ 5s
Using Yaml for Schema definition (built and defined by ourselves)
Self service for data analysts! with guidance and governance
Part 2: Data Engineering
Near Real Time Pipeline
Part 2: Data Engineering
Near Real Time PipelineBut, MemSQL is not managed service, it is on EC2.
It is easy to scale, but not autoscale yet.
So we are moving to… v2!!
Currently on usability testing test by analysts.
Self service, of course!
Part 2: Data Engineering
Near Real Time Pipeline
Part 2: Data Engineering
Analytical Pipeline
Heavy data
processing
query by BI Tools
6 hour data
delivery SLA
Part 2: Data Engineering
Analytical Pipeline
Interesting features:
• Custom dev/prod environment, for self service!
• Custom framework, on top of Spark
• Custom airflow, separated queue for backfill
• EMR autoscale for backfill
• Redshift microbatch bulk load
• etc...
Part 2: Data Engineering
Summary
Part Three
Data Science in Traveloka
Part 3: Data Science in Traveloka
Three
Things to
Discuss
Today
Data Science Purpose
Tools of the Trade
Model Evaluations and Applications
Part 3: Data Science in Traveloka
Three
Things to
Discuss
Today
Data Science Purpose
Tools of the Trade
Model Evaluations and Applications
Novia is 25 years old. She is single, outspoken, and
mathematically gifted. As a student, she was deeply
interested in calculus and statistics, and also participated in
International Mathematical Olympiad.
a. Novia is a data scientist
b. Novia is a data scientist and is active as mathematical
Olympiad tutor
Part 3: Data Science in Traveloka
Part 3: Data Science in Traveloka
Consider a regular six-sided die with four green faces and
two red faces. The die will be rolled 20 times and the
sequence of greens (G) and reds (R) will be recorded.
Choose one sequence from a set of three. Which one is the
more likely outcome?
RGRRR
GRGRRR
GRRRRR
Part 3: Data Science in Traveloka
Part 3: Data Science in Traveloka
Remember This:
The goal of data science exercise is to help us make
a good business decision
Logic
Alternatives
Information
Preferences
Part 3: Data Science in Traveloka
“if they learn nothing else about decision
analysis from their studies, distinction between
outcome and decisions will have been worth
the price of admission”
Ron Howard, Professor at Stanford University
Father of Decision Analysis
Part 3: Data Science in Traveloka
Good Bad
Good Took a taxi and arrived safely Drive home and arrived safely
Bad Took a taxi and involved in accident Drive home and involved in accident
Decisions
Outcome
Part 3: Data Science in Traveloka
Three
Things to
Discuss
Today
Data Science Purpose
Tools of the Trade
Model Evaluations and Applications
Data Science Framework: CRISP-DM
Business
Data
Data Prep
Model
Evaluation
Deployment
Common
Sense
Part 3: Data Science in Traveloka
“Hiding within those
mounds of data is
knowledge that could
change the life of a
patient, or change the
world”-Atul Butte, Stanford-
We use open source library for data science
Wrangling
• data.table
• dplyr
• sparkR
• sparklyr
• pandas
• pyspark
Visualization
• ggplot
• matplotlib
• seaborn
• shiny
Statistics
• R
• JAGS
• STAN
• Python
• Julia
Machine Learning
• scikit-learn
• caret
• e1071
• fbprophet
Part 3: Data Science in Traveloka
Are we using the algorithm? Or being used by it?
Cla
ssif
icat
ion
Linear Models
Naïve Bayes Classifier
Support Vector Classifier
Vowpal Wabbit Classifier
Random Forest
Decision Trees
Neural Network
Extreme Gradient Boosted Trees
Many more algos!
Pre
dic
tio
n
Linear Models
Nystroem Regressor
Support Vector Regressor
Vowpal Wabbit Regressor
Random Forest
Decision Trees
Neural Network
Extreme Gradient Boosted Trees
More Algos!
• Scikit-learn
• Caret
• TensorFlow
• …
Part 3: Data Science in Traveloka
We need more than just off the shelf libraries to
feed data hungry people
Bayesian Network Markov Chain Monte Carlo
Part 3: Data Science in Traveloka
Part 3: Data Science in Traveloka
Three
Things to
Discuss
Today
Data Science Purpose
Tools of the Trade
Model Evaluations and Applications
Model Evaluation: judging the usefulness of your model
Rule #1
Never ever peek at the test set during training/validation
Rule #2
You can never satisfy all the metrics,
pick one or two metrics as your decision criteria beforehand
Rule #3
Always do comparative statics on the final model
Part 3: Data Science in Traveloka
Comparative
Staticscommonly used as
feature importance
analysis
Part 3: Data Science in Traveloka
Remember the end goal: decisions
What should
we do?
What
might
happen
Part 3: Data Science in Traveloka
“But in my view,
obsessive customer focus
is by far the most protective of
Day 1 vitality”
Our data is telling us:
• What do they want?
• Do we serve their needs?
• Are they trying to leave us?
Part 3: Data Science in Traveloka
My name is Jeff
Thank you!