(CMP305) Deep Learning on AWS Made EasyCmp305

transcript

Danny Bickson, Co-founder DATO

CMP305

Deep Learning on AWSMade Easy

October 2015

Who is Dato?

Seattle-based Machine Learning Company

45+ and growing fast!

Deep learning example

©Dato

Image classification

Input: xImage pixels

Output: yPredicted object

Neural networks

Learning *very* non-linear features

Linear classifiers (binary)

Score(x) > 0 Score(x) < 0

Score(x) = w0 + w1 x1 + w2 x2 + … + wd xd

Graph representation of classifier:

useful for defining neural networks

> 0, output 1

< 0, output 0

Input Output

Score(x) = w0 + w1 x1 + w2 x2 + … + wd xd

What can a linear classifier represent?

x1 OR x2 x1 AND x2

What can’t a simple linear

classifier represent?

XOR the counterexample

to everything

Need non-linear features

Solving the XOR problem:

Adding a layerXOR = x1 AND NOT x2 OR NOT x1 AND x2

Thresholded to 0 or 1

A neural network• Layers and layers and layers of

linear models and non-linear transformations

• Around for about 50 years

• In last few years, big resurgence- Impressive accuracy on several benchmark problems

- Advanced in hardware allows computation (i.e. aws g2 instances)

Application of deep learning

to computer vision

Feature detection – traditional approach

• Features = local detectors- Combined to make prediction

- (in reality, features are more low-level)

SIFT [Lowe ‘99]

•Spin Images [Johnson & Herbert ‘99]

•Textons[Malik et al. ‘99]

•RIFT[Lazebnik ’04]

•GLOH[Mikolajczyk & Schmid ‘05]

•HoG

[Dalal & Triggs ‘05]

•…

Many hand created features exist for finding interest points…

Standard image

classification approach

Input Use simple classifiere.g., logistic regression, SVMs

Extract features

Hand-created

features

SIFT [Lowe

‘99]

•Spin Images [Johnson & Herbert ‘99]

•Textons[Malik et al. ‘99]

•RIFT[Lazebnik ’04]

•GLOH[Mikolajczyk & Schmid ‘05]

•HoG

[Dalal & Triggs ‘05]

•…

Many hand created features exist for finding interest points…

Hand-created

features

… but very painful to design

Deep learning:

implicitly learns features

Layer 1 Layer 2 Layer 3 Prediction

Example

detectors

learned

Example

interest points

detected

[Zeiler & Fergus ‘13]

Deep learning performance

Deep learning accuracy

• German traffic sign recognition benchmark- 99.5% accuracy (IDSIA

• House number recognition- 97.8% accuracy per character

[Goodfellow et al. ’13]

ImageNet 2012 competition: 1.2M training images, 1000 categories

SuperVision ISI OXFORD_VGGErr

of 5 g

uesses)

Exploited hand-coded features like SIFT

Top 3 teams

ImageNet 2012 competition:

1.2M training images, 1000 categoriesWinning entry: SuperVision

8 layers, 60M parameters [Krizhevsky et al. ’12]

Achieving these amazing results required:

• New learning algorithms

• GPU implementation

Deep learning performance• ImageNet: 1.2M images

g2.xlarge g2.8xlarge

Running time (hours)

Deep learning in computer vision

Scene parsing with deep learning

[Farabet et al. ‘13]

Retrieving similar imagesInput Image Nearest neighbors

Deep learning usability

Designed a simple user interface

#training the model

model = graphlab.neuralnet.create(train_images)

#predicting classes for new images

outcome = model.predict(test_images)

Deep learning demo

Challenges of deep learning

Deep learning score cardPros

• Enables learning of features rather than hand tuning

• Impressive performance gains

- Computer vision

- Speech recognition

- Some text analysis

• Potential for more impact

Deep learning workflow

Lots of

labeled

Training

Validation

neural net

Validate

Adjust

parameters,

network

architecture,…

Many tricks needed to work well…

Different types of layers, connections,… needed for high accuracy

[Krizhevsky et al. ’12]

Deep learning score cardPros

• Enables learning of features rather than hand tuning

• Impressive performance gains

- Computer vision

- Speech recognition

- Some text analysis

• Potential for more impact

• Requires a lot of data for

high accuracy

• Computationally

really expensive

• Extremely hard to tune

- Choice of architecture

- Parameter types

- Hyperparameters

- Learning algorithm

Computational cost+ so many

choices

incredibly hard to tune

Deep features:

Deep learning

Transfer learning

Standard image

classification approach

Input Use simple classifiere.g., logistic regression, SVMs

Extract features

Hand-created

features

Can we learn features

from data, even when

we don’t have data or

What’s learned in a neural net

Very specific

to Task 1

Should be ignored

for other tasks

More generic

Can be used as feature extractor

Neural net trained for Task 1: cat vs. dog

Transfer learning in more detail…

Very specific

to Task 1

Should be ignored

for other tasks

More generic

Can be used as feature extractor

For Task 2, predicting 101 categories,

learn only end part of neural net

Use simple classifiere.g., logistic regression,

SVMs, nearest neighbor,…

Class?Keep weights fixed!

Neural net trained for Task 1: cat vs. dog

Careful where you cut:

latter layers may be too task specific

Layer 1 Layer 2 Layer 3 Prediction

Example

detectors

learned

Example

interest points

detected

[Zeiler & Fergus ‘13]

Too specific

for new taskUse these!

Transfer learning with deep features workflow

labeled

Extract

features

neural net

trained on

different

simple

classifier

Validate

Training

Validation

How general are deep features?

Barcelona Buildings

Architectural transition

Deep learning in production on

How to use deep learning in

production?

PredictiveUnderstands input &

takes actions or

makes decisions

InteractiveResponds in real time

LearningImproves its

performance

with experience

Intelligent service at the core…

tellig

nIntelligent

backend

service

Real-time

Predictions &

decisions

Historical

Machine

learning

Predictions &

decisions

Most ML

research here…

But ML research useless

without great

solution here…

Essential ingredients of intelligent service

ResponsiveIntelligent applications

are interactive

Need low latency,

high throughput &

high availability

AdaptiveML models out-of-date the

moment learning is done

Need to constantly

understand & improve

end-to-end performance

ManageableMany thousands of models,

created by hundreds of people

Need versioning,

attribution, provenance &

reproducibility

Responsive: Now and Always

are interactive

Need low latency,

high throughput &

high availability

Need to constantly

Need versioning,

reproducibility

Addressing latency

Challenge: Scoring Latency

Compute predictions in < 20ms for complex

all while under heavy query load

Models Queries

Features

SELECT * FROM

users JOIN items,

click_logs, pages

WHERE …

The Common Solutions to Latency

Faster Online

Model Scoring

“Execute Predict(query) in

real-time as queries arrive”

Pre-Materialization

and Lookup

“Pre-compute Predict(query)

for all queries and lookup

answer at query time”Dato Predictive Services does Both

Faster Online Model Scoring:

Highly optimized machine learning

• SFrame: Native code, optimized data frame

- Available open-source (BSD)

• Model querying acceleration with native code,

- TopK and Nearest Neighbor eval:

• LSH, Ball Trees,…

The Common Solutions to Latency

Faster Online

Model Scoring

“Execute Predict(query) in

real-time as queries arrive”

Pre-Materialization

and Lookup

“Pre-compute Predict(query)

for all queries and lookup

answer at query time”Dato Predictive Services does Both

Smart Materialization Caching

Unique Queries

quency

Example: top 10% of all unique queries cover

90% of all queries performed.

Caching a small number of unique

queries has a very large impact.

Distributed shared caching

Distributed Shared Cache (Redis)

Cache:

Model query results

Common features (e.g., product info)

Scale-out improves

throughput and latency

Dato Latency by the numbers

Easy Case: cache hit ~2ms

Hard Case: cache miss

• Simple Linear Models: 5-6ms

• Complex Random Forests: 7-8ms

- P99: ~ 15ms

[using aws m3.xlarge instance]

Challenge: Availability

Heavy load substantial delays

Frequent model updates cache misses

Machine failures

Scale-Out availability under load

Heavy Load

Elastic Load Balancing load balancer

Adaptive:

Accounting for Constant Change

are interactive

Need low latency,

high throughput &

high availability

Need to constantly

Need versioning,

reproducibility

Change at Different Scales and Rates

Shopping

for Mom

Shopping

for Me

Months Rate of Change Minutes

Population Granularity of Change Session

Months Rate of Change Minutes

Population Granularity of Change SessionIndividual and Session Level Change

Small Data

Online learning

Bandits to Assess Models

Shopping

for Mom

Shopping

for Me

Change at Different Scales and Rates

The Dangerous Feedback Loop

I once looked at cameras on

Amazon …

Similar cameras

accessories

If this is all they showed how would they

learn that I also like bikes, and shoes?

Exploration / Exploitation Tradeoff

Systems that can take actions can

adversely affect future data

Exploration Exploitation

Action

Random

Action

Learn more about

what is good and bad

Make the best use

of what we believe is good.

Dato Solution to Adaptivity

Rapid offline learning with GraphLab Create

Online bandit adaptation in Predictive Services

• Demo

Manageable:

Unification and simplification

are interactive

Need low latency,

high throughput &

high availability

Need to constantly

Need versioning,

reproducibility

Ecosystem of Intelligent Services

Infrastructure MySQL

Serving

Data Science

ModelA ModelB

TableA

TableB

Service A

Service B

Complicated!Many systems, with overlapping roles,

no single source of truth for Intelligent Service.

Dato Predictive Services

Responsive Adaptive Manageable

Model Management like code management,

but for life cycle of intelligent applications

Provenance & Reproducibility

• Track changes & rollback

• Cover code, model type, parameters, data…

Collaboration

• Review, blame

• Share

• Common feature engineering pipelines

Continuous Integration

• Deploy & update

• Measure & improve

• Avoid down time and impact on end-users

Responsive Adaptive Manageable

Serving Models and Managing the

Machine Learning LifecycleGraphLab Create

Accurate, Robust, and Scalable

Model Training

GraphLab Create:Sophisticated machine learning made easy

High-level ML toolkits

AutoMLtune params, model

selection,…

so you can focus on creative parts

Reusablefeatures

transferrable feature engineering

accuracy with less data & less effort

High-level ML toolkits get started with 4 lines of code,

then modify, blend, add yours…

RecommenderImage search

Sentiment analysis

Data matching

Auto tagging

Churn predictor

Object detectorProduct

sentimentClick

predictionFraud detection

User segmentation

Data completion

Anomaly detection

Document clustering

Forecasting Search ranking

Summarization …

import graphlab as gl

data = gl.SFrame.read_csv('my_data.csv')

model = gl.recommender.create(data,

user_id='user',

item_id='movie’,

target='rating')

recommendations = model.recommend(k=5)

SFrame ❤️ all ML tools SGraph

SFrame:

Sophisticated machine learning made

scalable

Opportunity for Out-of-Core ML

Capacity 1 TB

0.5 GB/s

0.1 GB/s

0.1 TB

1 GB/sThroughput

Fast, but significantly

limits data sizeOpportunity for big data on 1 machine

For sequential reads only!

Random access very slow

Out-of-core ML opportunity is huge

Usual design → Lots of random access →

Design to maximize sequential access for

ML algo patterns

GraphChi early example

SFrame data frame for ML

Performance of SFrame/SGraph

70 sec

251 sec

200 sec

2,128 sec

0 750 1500 2250

GraphLab Create

GraphX

Giraph

Connected components in Twitter graph

Source(s): Gonzalez et. al. (OSDI 2014)

Twitter: 41 million Nodes, 1.4 billion Edges

SGraph

16 machines

1 machine

SFrame & SGraph

Optimizedout-of-core

computation for ML

High Performance

1 machine can handle:TBs of data

100s Billions of edges

Optimized for ML. Columnar transformation . Create features. Iterators. Filter, join, group-by, aggregate. User-defined functions . Easily extended through SDK

Tables,

graphs, text,

images

source ❤️BSD

license

The Dato Machine Learning Platform

Predictive Services

Serve Models and Manage the

Machine Learning Lifecycle

GraphLab Create

Train Accurate, Robust,

and Scalable models

Our customers

(CMP305) Deep Learning on AWS Made EasyCmp305

Technology