(CMP305) Deep Learning on AWS Made EasyCmp305

© 2015, Amazon Web Services, Inc. or its Affiliates. All rights reserved.

Danny Bickson, Co-founder DATO

CMP305

Deep Learning on AWSMade Easy

October 2015

2

Who is Dato?

Seattle-based Machine Learning Company

45+ and growing fast!

Deep learning example

©Dato

4

Image classification

Input: xImage pixels

Output: yPredicted object

Neural networks

Learning *very* non-linear features

6

Linear classifiers (binary)

Score(x) > 0 Score(x) < 0

Score(x) = w0 + w1 x1 + w2 x2 + … + wd xd

7

Graph representation of classifier:

useful for defining neural networks

x1

x2

xd

y…

1

w2

> 0, output 1

< 0, output 0

Input Output

Score(x) = w0 + w1 x1 + w2 x2 + … + wd xd

8

What can a linear classifier represent?

x1 OR x2 x1 AND x2

x1

x2

1

y x1

x2

1

y1

1

-0.5

1

1

-1.5

9

What can’t a simple linear

classifier represent?

XOR the counterexample

to everything

Need non-linear features

Solving the XOR problem:

Adding a layerXOR = x1 AND NOT x2 OR NOT x1 AND x2

z1

-0.5

1

-1

z1 z2

z2

-0.5

-1

1

x1

x2

1

y

1-0.5

1

1

Thresholded to 0 or 1

11

A neural network• Layers and layers and layers of

linear models and non-linear transformations

• Around for about 50 years

• In last few years, big resurgence- Impressive accuracy on several benchmark problems

- Advanced in hardware allows computation (i.e. aws g2 instances)

x1

x2

1

z1

z2

1

y

Application of deep learning

to computer vision

13

Feature detection – traditional approach

• Features = local detectors- Combined to make prediction

- (in reality, features are more low-level)

Face!

Eye

Eye

Nose

Mouth

14

SIFT [Lowe ‘99]

•Spin Images [Johnson & Herbert ‘99]

•Textons[Malik et al. ‘99]

•RIFT[Lazebnik ’04]

•GLOH[Mikolajczyk & Schmid ‘05]

•HoG

[Dalal & Triggs ‘05]

•…

Many hand created features exist for finding interest points…

15

Standard image

classification approach

Input Use simple classifiere.g., logistic regression, SVMs

Face?

Extract features

Hand-created

features

16

SIFT [Lowe

‘99]

•Spin Images [Johnson & Herbert ‘99]

•Textons[Malik et al. ‘99]

•RIFT[Lazebnik ’04]

•GLOH[Mikolajczyk & Schmid ‘05]

•HoG

[Dalal & Triggs ‘05]

•…

Many hand created features exist for finding interest points…

Hand-created

features

… but very painful to design

17

Deep learning:

implicitly learns features

Layer 1 Layer 2 Layer 3 Prediction

Example

detectors

learned

Example

interest points

detected

[Zeiler & Fergus ‘13]

Deep learning performance

Deep learning accuracy

• German traffic sign recognition benchmark- 99.5% accuracy (IDSIA

team)

• House number recognition- 97.8% accuracy per character

[Goodfellow et al. ’13]

ImageNet 2012 competition: 1.2M training images, 1000 categories

0

0.05

0.1

0.15

0.2

0.25

0.3

SuperVision ISI OXFORD_VGGErr

or

(best

of 5 g

uesses)

Huge

gain

Exploited hand-coded features like SIFT

Top 3 teams

ImageNet 2012 competition:

1.2M training images, 1000 categoriesWinning entry: SuperVision

8 layers, 60M parameters [Krizhevsky et al. ’12]

Achieving these amazing results required:

• New learning algorithms

• GPU implementation

Deep learning performance• ImageNet: 1.2M images

0

10

20

30

40

50

60

g2.xlarge g2.8xlarge

Running time (hours)

Deep learning in computer vision

Scene parsing with deep learning

[Farabet et al. ‘13]

Retrieving similar imagesInput Image Nearest neighbors

Deep learning usability

Designed a simple user interface

#training the model

model = graphlab.neuralnet.create(train_images)

#predicting classes for new images

outcome = model.predict(test_images)

Deep learning demo

Challenges of deep learning

Deep learning score cardPros

• Enables learning of features rather than hand tuning

• Impressive performance gains

- Computer vision

- Speech recognition

- Some text analysis

• Potential for more impact

Deep learning workflow

Lots of

labeled

data

Training

set

Validation

set

Learn

deep

neural net

Validate

Adjust

parameters,

network

architecture,…

32

Many tricks needed to work well…

Different types of layers, connections,… needed for high accuracy

[Krizhevsky et al. ’12]

Deep learning score cardPros

• Enables learning of features rather than hand tuning

• Impressive performance gains

- Computer vision

- Speech recognition

- Some text analysis

• Potential for more impact

Cons

• Requires a lot of data for

high accuracy

• Computationally

really expensive

• Extremely hard to tune

- Choice of architecture

- Parameter types

- Hyperparameters

- Learning algorithm

- …

Computational cost+ so many

choices

=

incredibly hard to tune

Deep features:

Deep learning

+

Transfer learning

35

Standard image

classification approach

Input Use simple classifiere.g., logistic regression, SVMs

Face?

Extract features

Hand-created

features

Can we learn features

from data, even when

we don’t have data or

time?

36

What’s learned in a neural net

Very specific

to Task 1

Should be ignored

for other tasks

More generic

Can be used as feature extractor

vs.

Neural net trained for Task 1: cat vs. dog

37

Transfer learning in more detail…

Very specific

to Task 1

Should be ignored

for other tasks

More generic

Can be used as feature extractor

For Task 2, predicting 101 categories,

learn only end part of neural net

Use simple classifiere.g., logistic regression,

SVMs, nearest neighbor,…

Class?Keep weights fixed!

Neural net trained for Task 1: cat vs. dog

38

Careful where you cut:

latter layers may be too task specific

Layer 1 Layer 2 Layer 3 Prediction

Example

detectors

learned

Example

interest points

detected

[Zeiler & Fergus ‘13]

Too specific

for new taskUse these!

Transfer learning with deep features workflow

Some

labeled

data

Extract

features

with

neural net

trained on

different

task

Learn

simple

classifier

Validate

Training

set

Validation

set

How general are deep features?

Barcelona Buildings

Architectural transition

Deep learning in production on

AWS

44

How to use deep learning in

production?

PredictiveUnderstands input &

takes actions or

makes decisions

InteractiveResponds in real time

LearningImproves its

performance

with experience

Intelligent service at the core…

46

Yo

ur in

tellig

en

t ap

plic

atio

nIntelligent

backend

service

Real-time

data

Predictions &

decisions

Historical

data

Machine

learning

model

Predictions &

decisions

Most ML

research here…

But ML research useless

without great

solution here…

47

Essential ingredients of intelligent service

ResponsiveIntelligent applications

are interactive

Need low latency,

high throughput &

high availability

AdaptiveML models out-of-date the

moment learning is done

Need to constantly

understand & improve

end-to-end performance

ManageableMany thousands of models,

created by hundreds of people

Need versioning,

attribution, provenance &

reproducibility

Responsive: Now and Always


are interactive

Need low latency,

high throughput &

high availability



Need to constantly





Need versioning,


reproducibility

Addressing latency

50

Challenge: Scoring Latency

Compute predictions in < 20ms for complex

all while under heavy query load

Models Queries

Top K

Features

SELECT * FROM

users JOIN items,

click_logs, pages

WHERE …

51

The Common Solutions to Latency

Faster Online

Model Scoring

“Execute Predict(query) in

real-time as queries arrive”

Pre-Materialization

and Lookup

“Pre-compute Predict(query)

for all queries and lookup

answer at query time”Dato Predictive Services does Both

52

Faster Online Model Scoring:

Highly optimized machine learning

• SFrame: Native code, optimized data frame

- Available open-source (BSD)

• Model querying acceleration with native code,

e.g.,

- TopK and Nearest Neighbor eval:

• LSH, Ball Trees,…

53

The Common Solutions to Latency

Faster Online

Model Scoring

“Execute Predict(query) in

real-time as queries arrive”

Pre-Materialization

and Lookup

“Pre-compute Predict(query)

for all queries and lookup

answer at query time”Dato Predictive Services does Both

54

Smart Materialization Caching

Unique Queries

Qu

ery

Fre

quency

Example: top 10% of all unique queries cover

90% of all queries performed.

Caching a small number of unique

queries has a very large impact.

55

Distributed shared caching

Distributed Shared Cache (Redis)

Cache:

Model query results

Common features (e.g., product info)

Scale-out improves

throughput and latency

56

Dato Latency by the numbers

Easy Case: cache hit ~2ms

Hard Case: cache miss

• Simple Linear Models: 5-6ms

• Complex Random Forests: 7-8ms

- P99: ~ 15ms

[using aws m3.xlarge instance]

57

Challenge: Availability

Heavy load substantial delays

Frequent model updates cache misses

Machine failures

58

Scale-Out availability under load

Heavy Load

Elastic Load Balancing load balancer

Adaptive:

Accounting for Constant Change


are interactive

Need low latency,

high throughput &

high availability



Need to constantly





Need versioning,


reproducibility

60

Change at Different Scales and Rates

Shopping

for Mom

Shopping

for Me

Months Rate of Change Minutes

Population Granularity of Change Session

61

Months Rate of Change Minutes

Population Granularity of Change SessionIndividual and Session Level Change

Small Data

Online learning

Bandits to Assess Models

Shopping

for Mom

Shopping

for Me

Change at Different Scales and Rates

62

The Dangerous Feedback Loop

I once looked at cameras on

Amazon …

Bags

Similar cameras

and

accessories

If this is all they showed how would they

learn that I also like bikes, and shoes?

63

Exploration / Exploitation Tradeoff

Systems that can take actions can

adversely affect future data

Exploration Exploitation

Best

Action

Random

Action

Learn more about

what is good and bad

Make the best use

of what we believe is good.

64

Dato Solution to Adaptivity

Rapid offline learning with GraphLab Create

Online bandit adaptation in Predictive Services

• Demo

Manageable:

Unification and simplification


are interactive

Need low latency,

high throughput &

high availability



Need to constantly





Need versioning,


reproducibility

66

Ecosystem of Intelligent Services

Data

Infrastructure MySQL

MySQL

Serving

Data Science

ModelA ModelB

TableA

TableB

Service A

Service B

Complicated!Many systems, with overlapping roles,

no single source of truth for Intelligent Service.

67

Dato Predictive Services

Responsive Adaptive Manageable

68

Model Management like code management,

but for life cycle of intelligent applications

Provenance & Reproducibility

• Track changes & rollback

• Cover code, model type, parameters, data…

Collaboration

• Review, blame

• Share

• Common feature engineering pipelines

Continuous Integration

• Deploy & update

• Measure & improve

• Avoid down time and impact on end-users

69


Responsive Adaptive Manageable


Serving Models and Managing the

Machine Learning LifecycleGraphLab Create

Accurate, Robust, and Scalable

Model Training

GraphLab Create:Sophisticated machine learning made easy

High-level ML toolkits

AutoMLtune params, model

selection,…

so you can focus on creative parts

Reusablefeatures

transferrable feature engineering

accuracy with less data & less effort

71

High-level ML toolkits get started with 4 lines of code,

then modify, blend, add yours…

RecommenderImage search

Sentiment analysis

Data matching

Auto tagging

Churn predictor

Object detectorProduct

sentimentClick

predictionFraud detection

User segmentation

Data completion

Anomaly detection

Document clustering

Forecasting Search ranking

Summarization …

import graphlab as gl

data = gl.SFrame.read_csv('my_data.csv')

model = gl.recommender.create(data,

user_id='user',

item_id='movie’,

target='rating')

recommendations = model.recommend(k=5)

SFrame ❤️ all ML tools SGraph

SFrame:

Sophisticated machine learning made

scalable

Opportunity for Out-of-Core ML

Capacity 1 TB

0.5 GB/s

10 TB

0.1 GB/s

0.1 TB

1 GB/sThroughput

Fast, but significantly

limits data sizeOpportunity for big data on 1 machine

For sequential reads only!

Random access very slow

Out-of-core ML opportunity is huge

Usual design → Lots of random access →

Slow

Design to maximize sequential access for

ML algo patterns

GraphChi early example

SFrame data frame for ML

Performance of SFrame/SGraph

70 sec

251 sec

200 sec

2,128 sec

0 750 1500 2250

GraphLab Create

GraphX

Giraph

Spark

Connected components in Twitter graph

Source(s): Gonzalez et. al. (OSDI 2014)

Twitter: 41 million Nodes, 1.4 billion Edges

SGraph

16 machines

1 machine

https://amplab.cs.berkeley.edu/wp-content/uploads/2014/09/graphx.pdf

75

SFrame & SGraph

Optimizedout-of-core

computation for ML

High Performance

1 machine can handle:TBs of data

100s Billions of edges

Optimized for ML. Columnar transformation . Create features. Iterators. Filter, join, group-by, aggregate. User-defined functions . Easily extended through SDK

Tables,

graphs, text,

images

Open-

source ❤️BSD

license

76

The Dato Machine Learning Platform

Predictive Services

Serve Models and Manage the

Machine Learning Lifecycle

GraphLab Create

Train Accurate, Robust,

and Scalable models

77

Our customers

Date post:	20-Mar-2017
Category:	Technology
Upload:	amazon-web-services
View:	2,150 times
Download:	1 times

(CMP305) Deep Learning on AWS Made EasyCmp305

Technology