Le Song, Assistant Professor, College of Computing, Georgia Institute of Technology at MLconf ATL...

transcript

Understanding Deep Learning for

Big Data

Le Songhttp://www.cc.gatech.edu/~lsong/

College of ComputingGeorgia Institute of Technology

AlexNet: deep convolution neural networks

40964096

Rectified linear unit: 224

3.7 million parameters

cat/bike/…?

a benchmark image classification problem~ 1.3 million examples, ~ 1 thousand classes

Training is end-to-end

Minimize negative log-likelihood over data points

(Stochastic) gradient descent

AlexNet achieve~40%

top-1 error

Traditional image features not learned end-to-end

Handcrafted feature extractor

(eg. SIFT)

Divide imageto patches

Combine features

Learn classifier

Rectified linear unit:

Deep learning not fully understood

40964096

Fully connected layers crucial?

Convolution layers crucial?

Train end-to-end important?

Experiments

1. Fully connected layers crucial?

2. Convolution layers crucial?3. Learn parameters end-to-

end crucial?

Kernel methods: alternative nonlinear model

Combination of random basis functions

∑𝑖=1

𝛼𝑖exp (−‖𝑤𝑖−𝑥‖2 )

𝛼1 𝛼2 𝛼3 𝛼4 𝛼5 𝛼6 𝛼7

𝑤2 𝑤3 𝑤4 𝑤5 𝑤6 𝑤7 𝑥𝑤1

[Dai et al. NIPS 14]

Replace fully connected by kernel methodsI. Jointly trained neural nets (AlexNet)

II. Fixed neural nets

III. Scalable kernel methods [Dai et al. NIPS 14]

Learn Fix

Learn classifiers from a benchmark subset of~ 1.3 million examples, ~ 1 thousand classes

Kernel machine learns fasterImageNet 1.3M original images, and 1000 classesRandom cropping and mirroring images in streaming fashion

Number of training samples105

Test top-1 error

106 107 108

jointly-trained neural netfixed neural netdoubly SGD

Training 1 week using GPU

47.844.542.6

Random guessing 99.9% error

Similar results with MNIST8MClassification with handwritten digits8M images, 10 classes

LeNet5

Similar results with CIFAR10Classification with internet images60K images, 10 classes

Experiments

1. Fully connected layers crucial? No

2. Convolution layers crucial?3. Learn parameters end-to-

end crucial?

Kernel methods directly on inputs?

Fixed convolutionWithout convolution

MNIST0

0.20.40.60.8

2 convolution layerCIFAR10

10152025303540

2 convolution layersImageNet

020406080

5 convolution layers

Kernel methods + random convolutions?

Fixed convolutionWithout convolution Random convolution

MNIST0

0.20.40.60.8

2 convolution layerCIFAR10

10152025303540

2 convolution layers

# random conv

# fixed conv

Random

Structured composition usefulNot just fully connected layers, and plain

composition

Structured composition of nonlinear functions

the same function

Experiments

1. Fully connected layers crucial? No

2. Convolution layers crucial? Yes

3. Learn parameters end-to-end crucial?

Lots of random features used

58M parameters

131M parameters

AlexNet

ScalableKernel Method

40964096256

131M parameters needed?

58M parameters

32M parameters

AlexNetErro

40964096256

Basis function adaptation crucialIntegrated squared approximation error by basis function [Barron ‘93]

Error ofadapting basis function

Error offixed basis function

𝑓 (𝑥 )=∑𝑖=1

𝛼 𝑖𝑘 (𝑥 𝑖 ,𝑥 )

𝛼1𝛼2𝛼3 𝛼4𝛼5𝛼6 𝛼7

𝑥1 𝑥2 𝑥3 𝑥4 𝑥5 𝑥6 𝑥7

𝑘(𝑥 𝑖 ,𝑥)

𝑓 (𝑥 )=∑𝑖=1

𝛼 𝑖𝑘𝜃𝑖 (𝑥𝑖 , 𝑥 )

𝑥1 𝑥2

𝑘𝜃 𝑖(𝑥𝑖 , 𝑥)

𝛼1 𝛼2

Learning random features helps a lot

58M parameters

32M parametersLearn and basis adaptation

AlexNetErro

40964096256

Fix22/50

Learning convolution together helps more

58M parameters

32M parametersLearn and basis adaptation

AlexNetErro

40964096256

Jointly learn

Lesson learned:Exploit Structure & Train

End-to-EndDeep learning over (time-

varying) graph

Co-evolutionary features

ChristineAliceDavid Jacob

Item embedding

User embedding

User-item interactions evolve over time

User embedding

Co-evolutionary featuresItem embedding

User embedding

Item embedding

User embedding

Item embedding

User embedding

03/0206/02

User embedding

Co-evolutionary embedding

Initialize item embedding

Initialize user embedding

(𝑢𝑛 ,𝑖𝑛 ,𝑡𝑛 ,𝑞𝑛)

Item raw profile features

User raw profile features

DriftContext

EvolutionCo-evolutionUser Item𝑓 𝑖𝑛 (𝑡𝑛 )=h(

𝑉 1 ⋅ 𝑓 𝑖𝑛 (𝑡𝑛− )

+𝑉 2⋅ 𝑓 𝑢𝑛(𝑡𝑛−)

+𝑉 3 ⋅𝑞𝑛

+𝑉 4 ⋅(𝑡𝑛−𝑡𝑛− 1))Update

DriftContext

EvolutionCo-evolutionItemUser𝑓 𝑢𝑛

(𝑡𝑛 )=h(𝑊 1⋅ 𝑓 𝑢𝑛

(𝑡𝑛− )+𝑊 2 ⋅ 𝑓 𝑖𝑛 (𝑡𝑛

− )+𝑊 3 ⋅𝑞𝑛

+𝑊 4 ⋅(𝑡𝑛− 𝑡𝑛−1))Update

[Dai et al. Recsys16]

Deep learning with time-varying computation graph

𝑡 2

𝑡 3

𝑡 0

Mini-batch 1

Computation graph of RNN determined by

1. The bipartite interaction graph

2. The temporal ordering of events

Much improvement prediction on Reddit dataset

Next item prediction Return time prediction

1,000 users, 1403 groups, ~10K interactionsMAR: mean absolute rank differenceMAE: mean absolute error (hours)

Predicting efficiency of solar panel materials

Dataset Harvard clean energy project

Data point #

2.3 million

Type MoleculeAtom type 6Avg node #

Avg edge #

33Power Conversion Efficiency (PCE)

(0 -12 %)predict

Organic Solar Panel Materials

Structure2Vec

𝜇2(1)

𝜇2(0)

𝜇1(0)𝜇3(1)𝜇1(1)

……

𝜇2(𝑇 ) 𝜇3(𝑇 )𝜇1(𝑇 )

𝑋 6

𝑋 1

𝑋 2 𝑋 3

𝑋 4

𝑋 5

𝐻2 𝐻

𝜒𝜇6(0)

……

Iteration 1:

Iteration :

Label classification/regressionwith parameter

Aggregate𝜇1(𝑇 )

𝜇2(𝑇 ) +¿+¿⋮

¿𝜇𝑎(𝑊 , 𝜒 )

[Dai et al. ICML 16]

Improved prediction with small model

Structure2vec gets ~4% relative error with 10,000 times smaller model!

Test MAE Test RMSE

# parameter

predictor1.986 2.406 1

WL level-3 0.143 0.204 1.6 mWL level-6 0.096 0.137 1378 mstructure2v

ec0.085 0.117 0.1 m

10% data for testing

Take Home Message:

Deep fully connected layers not the key

Exploit structure (CNN, Coevolution, Structure2vec)

Train end-to-end