Date post: | 20-Mar-2017 |
Category: |
Technology |
Upload: | amazon-web-services |
View: | 2,150 times |
Download: | 1 times |
© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Danny Bickson, Co-founder DATO
CMP305
Deep Learning on AWSMade Easy
October 2015
2
Who is Dato?
Seattle-based Machine Learning Company
45+ and growing fast!
Deep learning example
©Dato
4
Image classification
Input: xImage pixels
Output: yPredicted object
Neural networks
Learning *very* non-linear features
6
Linear classifiers (binary)
Score(x) > 0 Score(x) < 0
Score(x) = w0 + w1 x1 + w2 x2 + … + wd xd
7
Graph representation of classifier:
useful for defining neural networks
x1
x2
xd
y…
1
w2
> 0, output 1
< 0, output 0
Input Output
Score(x) = w0 + w1 x1 + w2 x2 + … + wd xd
8
What can a linear classifier represent?
x1 OR x2 x1 AND x2
x1
x2
1
y x1
x2
1
y1
1
-0.5
1
1
-1.5
9
What can’t a simple linear
classifier represent?
XOR the counterexample
to everything
Need non-linear features
Solving the XOR problem:
Adding a layerXOR = x1 AND NOT x2 OR NOT x1 AND x2
z1
-0.5
1
-1
z1 z2
z2
-0.5
-1
1
x1
x2
1
y
1-0.5
1
1
Thresholded to 0 or 1
11
A neural network• Layers and layers and layers of
linear models and non-linear transformations
• Around for about 50 years
• In last few years, big resurgence- Impressive accuracy on several benchmark problems
- Advanced in hardware allows computation (i.e. aws g2 instances)
x1
x2
1
z1
z2
1
y
Application of deep learning
to computer vision
13
Feature detection – traditional approach
• Features = local detectors- Combined to make prediction
- (in reality, features are more low-level)
Face!
Eye
Eye
Nose
Mouth
14
SIFT [Lowe ‘99]
•Spin Images [Johnson & Herbert ‘99]
•Textons[Malik et al. ‘99]
•RIFT[Lazebnik ’04]
•GLOH[Mikolajczyk & Schmid ‘05]
•HoG
[Dalal & Triggs ‘05]
•…
Many hand created features exist for finding interest points…
15
Standard image
classification approach
Input Use simple classifiere.g., logistic regression, SVMs
Face?
Extract features
Hand-created
features
16
SIFT [Lowe
‘99]
•Spin Images [Johnson & Herbert ‘99]
•Textons[Malik et al. ‘99]
•RIFT[Lazebnik ’04]
•GLOH[Mikolajczyk & Schmid ‘05]
•HoG
[Dalal & Triggs ‘05]
•…
Many hand created features exist for finding interest points…
Hand-created
features
… but very painful to design
17
Deep learning:
implicitly learns features
Layer 1 Layer 2 Layer 3 Prediction
Example
detectors
learned
Example
interest points
detected
[Zeiler & Fergus ‘13]
Deep learning performance
Deep learning accuracy
• German traffic sign recognition benchmark- 99.5% accuracy (IDSIA
team)
• House number recognition- 97.8% accuracy per character
[Goodfellow et al. ’13]
ImageNet 2012 competition: 1.2M training images, 1000 categories
0
0.05
0.1
0.15
0.2
0.25
0.3
SuperVision ISI OXFORD_VGGErr
or
(best
of 5 g
uesses)
Huge
gain
Exploited hand-coded features like SIFT
Top 3 teams
ImageNet 2012 competition:
1.2M training images, 1000 categoriesWinning entry: SuperVision
8 layers, 60M parameters [Krizhevsky et al. ’12]
Achieving these amazing results required:
• New learning algorithms
• GPU implementation
Deep learning performance• ImageNet: 1.2M images
0
10
20
30
40
50
60
g2.xlarge g2.8xlarge
Running time (hours)
Deep learning in computer vision
Scene parsing with deep learning
[Farabet et al. ‘13]
Retrieving similar imagesInput Image Nearest neighbors
Deep learning usability
Designed a simple user interface
#training the model
model = graphlab.neuralnet.create(train_images)
#predicting classes for new images
outcome = model.predict(test_images)
Deep learning demo
Challenges of deep learning
Deep learning score cardPros
• Enables learning of features rather than hand tuning
• Impressive performance gains
- Computer vision
- Speech recognition
- Some text analysis
• Potential for more impact
Deep learning workflow
Lots of
labeled
data
Training
set
Validation
set
Learn
deep
neural net
Validate
Adjust
parameters,
network
architecture,…
32
Many tricks needed to work well…
Different types of layers, connections,… needed for high accuracy
[Krizhevsky et al. ’12]
Deep learning score cardPros
• Enables learning of features rather than hand tuning
• Impressive performance gains
- Computer vision
- Speech recognition
- Some text analysis
• Potential for more impact
Cons
• Requires a lot of data for
high accuracy
• Computationally
really expensive
• Extremely hard to tune
- Choice of architecture
- Parameter types
- Hyperparameters
- Learning algorithm
- …
Computational cost+ so many
choices
=
incredibly hard to tune
Deep features:
Deep learning
+
Transfer learning
35
Standard image
classification approach
Input Use simple classifiere.g., logistic regression, SVMs
Face?
Extract features
Hand-created
features
Can we learn features
from data, even when
we don’t have data or
time?
36
What’s learned in a neural net
Very specific
to Task 1
Should be ignored
for other tasks
More generic
Can be used as feature extractor
vs.
Neural net trained for Task 1: cat vs. dog
37
Transfer learning in more detail…
Very specific
to Task 1
Should be ignored
for other tasks
More generic
Can be used as feature extractor
For Task 2, predicting 101 categories,
learn only end part of neural net
Use simple classifiere.g., logistic regression,
SVMs, nearest neighbor,…
Class?Keep weights fixed!
Neural net trained for Task 1: cat vs. dog
38
Careful where you cut:
latter layers may be too task specific
Layer 1 Layer 2 Layer 3 Prediction
Example
detectors
learned
Example
interest points
detected
[Zeiler & Fergus ‘13]
Too specific
for new taskUse these!
Transfer learning with deep features workflow
Some
labeled
data
Extract
features
with
neural net
trained on
different
task
Learn
simple
classifier
Validate
Training
set
Validation
set
How general are deep features?
Barcelona Buildings
Architectural transition
Deep learning in production on
AWS
44
How to use deep learning in
production?
PredictiveUnderstands input &
takes actions or
makes decisions
InteractiveResponds in real time
LearningImproves its
performance
with experience
Intelligent service at the core…
46
Yo
ur in
tellig
en
t ap
plic
atio
nIntelligent
backend
service
Real-time
data
Predictions &
decisions
Historical
data
Machine
learning
model
Predictions &
decisions
Most ML
research here…
But ML research useless
without great
solution here…
47
Essential ingredients of intelligent service
ResponsiveIntelligent applications
are interactive
Need low latency,
high throughput &
high availability
AdaptiveML models out-of-date the
moment learning is done
Need to constantly
understand & improve
end-to-end performance
ManageableMany thousands of models,
created by hundreds of people
Need versioning,
attribution, provenance &
reproducibility
Responsive: Now and Always
ResponsiveIntelligent applications
are interactive
Need low latency,
high throughput &
high availability
AdaptiveML models out-of-date the
moment learning is done
Need to constantly
understand & improve
end-to-end performance
ManageableMany thousands of models,
created by hundreds of people
Need versioning,
attribution, provenance &
reproducibility
Addressing latency
50
Challenge: Scoring Latency
Compute predictions in < 20ms for complex
all while under heavy query load
Models Queries
Top K
Features
SELECT * FROM
users JOIN items,
click_logs, pages
WHERE …
51
The Common Solutions to Latency
Faster Online
Model Scoring
“Execute Predict(query) in
real-time as queries arrive”
Pre-Materialization
and Lookup
“Pre-compute Predict(query)
for all queries and lookup
answer at query time”Dato Predictive Services does Both
52
Faster Online Model Scoring:
Highly optimized machine learning
• SFrame: Native code, optimized data frame
- Available open-source (BSD)
• Model querying acceleration with native code,
e.g.,
- TopK and Nearest Neighbor eval:
• LSH, Ball Trees,…
53
The Common Solutions to Latency
Faster Online
Model Scoring
“Execute Predict(query) in
real-time as queries arrive”
Pre-Materialization
and Lookup
“Pre-compute Predict(query)
for all queries and lookup
answer at query time”Dato Predictive Services does Both
54
Smart Materialization Caching
Unique Queries
Qu
ery
Fre
quency
Example: top 10% of all unique queries cover
90% of all queries performed.
Caching a small number of unique
queries has a very large impact.
55
Distributed shared caching
Distributed Shared Cache (Redis)
Cache:
Model query results
Common features (e.g., product info)
Scale-out improves
throughput and latency
56
Dato Latency by the numbers
Easy Case: cache hit ~2ms
Hard Case: cache miss
• Simple Linear Models: 5-6ms
• Complex Random Forests: 7-8ms
- P99: ~ 15ms
[using aws m3.xlarge instance]
57
Challenge: Availability
Heavy load substantial delays
Frequent model updates cache misses
Machine failures
58
Scale-Out availability under load
Heavy Load
Elastic Load Balancing load balancer
Adaptive:
Accounting for Constant Change
ResponsiveIntelligent applications
are interactive
Need low latency,
high throughput &
high availability
AdaptiveML models out-of-date the
moment learning is done
Need to constantly
understand & improve
end-to-end performance
ManageableMany thousands of models,
created by hundreds of people
Need versioning,
attribution, provenance &
reproducibility
60
Change at Different Scales and Rates
Shopping
for Mom
Shopping
for Me
Months Rate of Change Minutes
Population Granularity of Change Session
61
Months Rate of Change Minutes
Population Granularity of Change SessionIndividual and Session Level Change
Small Data
Online learning
Bandits to Assess Models
Shopping
for Mom
Shopping
for Me
Change at Different Scales and Rates
62
The Dangerous Feedback Loop
I once looked at cameras on
Amazon …
Bags
Similar cameras
and
accessories
If this is all they showed how would they
learn that I also like bikes, and shoes?
63
Exploration / Exploitation Tradeoff
Systems that can take actions can
adversely affect future data
Exploration Exploitation
Best
Action
Random
Action
Learn more about
what is good and bad
Make the best use
of what we believe is good.
64
Dato Solution to Adaptivity
Rapid offline learning with GraphLab Create
Online bandit adaptation in Predictive Services
• Demo
Manageable:
Unification and simplification
ResponsiveIntelligent applications
are interactive
Need low latency,
high throughput &
high availability
AdaptiveML models out-of-date the
moment learning is done
Need to constantly
understand & improve
end-to-end performance
ManageableMany thousands of models,
created by hundreds of people
Need versioning,
attribution, provenance &
reproducibility
66
Ecosystem of Intelligent Services
Data
Infrastructure MySQL
MySQL
Serving
Data Science
ModelA ModelB
TableA
TableB
Service A
Service B
Complicated!Many systems, with overlapping roles,
no single source of truth for Intelligent Service.
67
Dato Predictive Services
Responsive Adaptive Manageable
68
Model Management like code management,
but for life cycle of intelligent applications
Provenance & Reproducibility
• Track changes & rollback
• Cover code, model type, parameters, data…
Collaboration
• Review, blame
• Share
• Common feature engineering pipelines
Continuous Integration
• Deploy & update
• Measure & improve
• Avoid down time and impact on end-users
69
Dato Predictive Services
Responsive Adaptive Manageable
Dato Predictive Services
Serving Models and Managing the
Machine Learning LifecycleGraphLab Create
Accurate, Robust, and Scalable
Model Training
GraphLab Create:Sophisticated machine learning made easy
High-level ML toolkits
AutoMLtune params, model
selection,…
so you can focus on creative parts
Reusablefeatures
transferrable feature engineering
accuracy with less data & less effort
71
High-level ML toolkits get started with 4 lines of code,
then modify, blend, add yours…
RecommenderImage search
Sentiment analysis
Data matching
Auto tagging
Churn predictor
Object detectorProduct
sentimentClick
predictionFraud detection
User segmentation
Data completion
Anomaly detection
Document clustering
Forecasting Search ranking
Summarization …
import graphlab as gl
data = gl.SFrame.read_csv('my_data.csv')
model = gl.recommender.create(data,
user_id='user',
item_id='movie’,
target='rating')
recommendations = model.recommend(k=5)
SFrame ❤️ all ML tools SGraph
SFrame:
Sophisticated machine learning made
scalable
Opportunity for Out-of-Core ML
Capacity 1 TB
0.5 GB/s
10 TB
0.1 GB/s
0.1 TB
1 GB/sThroughput
Fast, but significantly
limits data sizeOpportunity for big data on 1 machine
For sequential reads only!
Random access very slow
Out-of-core ML opportunity is huge
Usual design → Lots of random access →
Slow
Design to maximize sequential access for
ML algo patterns
GraphChi early example
SFrame data frame for ML
Performance of SFrame/SGraph
70 sec
251 sec
200 sec
2,128 sec
0 750 1500 2250
GraphLab Create
GraphX
Giraph
Spark
Connected components in Twitter graph
Source(s): Gonzalez et. al. (OSDI 2014)
Twitter: 41 million Nodes, 1.4 billion Edges
SGraph
16 machines
1 machine
75
SFrame & SGraph
Optimizedout-of-core
computation for ML
High Performance
1 machine can handle:TBs of data
100s Billions of edges
Optimized for ML. Columnar transformation . Create features. Iterators. Filter, join, group-by, aggregate. User-defined functions . Easily extended through SDK
Tables,
graphs, text,
images
Open-
source ❤️BSD
license
76
The Dato Machine Learning Platform
Predictive Services
Serve Models and Manage the
Machine Learning Lifecycle
GraphLab Create
Train Accurate, Robust,
and Scalable models
77
Our customers