Date post: | 19-Jan-2017 |
Category: |
Technology |
Upload: | mlconf |
View: | 158 times |
Download: | 0 times |
Understanding Deep Learning for
Big Data
Le Songhttp://www.cc.gatech.edu/~lsong/
College of ComputingGeorgia Institute of Technology
1
2
AlexNet: deep convolution neural networks
11
11
5
5
33
33
256
13
1333
40964096
1000
Rectified linear unit: 224
224
3
55
55
96256
27
27
384
13
13
384
13
13
3.7 million parameters
58.6 million parameters
Pr
Image
Label
cat/bike/โฆ?
3
a benchmark image classification problem~ 1.3 million examples, ~ 1 thousand classes
4
Training is end-to-end
Minimize negative log-likelihood over data points
(Stochastic) gradient descent
Pr
AlexNet achieve~40%
top-1 error
5
Traditional image features not learned end-to-end
Handcrafted feature extractor
(eg. SIFT)
Divide imageto patches
Combine features
Learn classifier
6
Rectified linear unit:
Deep learning not fully understood
11
11
5
5
33
33
256
13
1333
40964096
1000
224
224
3
55
55
96256
27
27
384
13
13
384
13
13
3.7 million parameters
58.6 million parameters
Fully connected layers crucial?
Convolution layers crucial?
Image
Train end-to-end important?
Pr
Experiments
1. Fully connected layers crucial?
2. Convolution layers crucial?3. Learn parameters end-to-
end crucial?
8
Kernel methods: alternative nonlinear model
Combination of random basis functions
โ๐=1
7
๐ผ๐exp (โโ๐ค๐โ๐ฅโ2 )
๐ผ1 ๐ผ2 ๐ผ3 ๐ผ4 ๐ผ5 ๐ผ6 ๐ผ7
๐ค2 ๐ค3 ๐ค4 ๐ค5 ๐ค6 ๐ค7 ๐ฅ๐ค1
[Dai et al. NIPS 14]
๐ฅ
9
Replace fully connected by kernel methodsI. Jointly trained neural nets (AlexNet)
Pr
Learn
II. Fixed neural nets
III. Scalable kernel methods [Dai et al. NIPS 14]
Learn Fix
Learn Fix
10
Learn classifiers from a benchmark subset of~ 1.3 million examples, ~ 1 thousand classes
11
Kernel machine learns fasterImageNet 1.3M original images, and 1000 classesRandom cropping and mirroring images in streaming fashion
Number of training samples105
40
60
80
100
Test top-1 error
(%)
106 107 108
jointly-trained neural netfixed neural netdoubly SGD
Training 1 week using GPU
47.844.542.6
Random guessing 99.9% error
12
Similar results with MNIST8MClassification with handwritten digits8M images, 10 classes
LeNet5
13
Similar results with CIFAR10Classification with internet images60K images, 10 classes
Experiments
1. Fully connected layers crucial? No
2. Convolution layers crucial?3. Learn parameters end-to-
end crucial?
15
Kernel methods directly on inputs?
Fixed convolutionWithout convolution
MNIST0
0.20.40.60.8
11.2
2 convolution layerCIFAR10
05
10152025303540
2 convolution layersImageNet
020406080
100
5 convolution layers
16
Kernel methods + random convolutions?
Fixed convolutionWithout convolution Random convolution
MNIST0
0.20.40.60.8
11.2
2 convolution layerCIFAR10
05
10152025303540
2 convolution layers
# random conv
# fixed conv
Random
17
Structured composition usefulNot just fully connected layers, and plain
composition
Structured composition of nonlinear functions
the same function
Experiments
1. Fully connected layers crucial? No
2. Convolution layers crucial? Yes
3. Learn parameters end-to-end crucial?
19
Lots of random features used
58M parameters
131M parameters
AlexNet
ScalableKernel Method
Error
42.6%
Error
44.5%
1000
40964096256
13
13
25613
13
131K
1000
Fix
20
131M parameters needed?
58M parameters
32M parameters
AlexNetErro
r42.6
%
Error
50.0%
1000
40964096256
13
13
25613
13
32K
1000
ScalableKernel Method
Fix
21
Basis function adaptation crucialIntegrated squared approximation error by basis function [Barron โ93]
Error ofadapting basis function
Error offixed basis function
๐ (๐ฅ )=โ๐=1
7
๐ผ ๐๐ (๐ฅ ๐ ,๐ฅ )
๐ผ1๐ผ2๐ผ3 ๐ผ4๐ผ5๐ผ6 ๐ผ7
๐ฅ1 ๐ฅ2 ๐ฅ3 ๐ฅ4 ๐ฅ5 ๐ฅ6 ๐ฅ7
๐(๐ฅ ๐ ,๐ฅ)
๐ (๐ฅ )=โ๐=1
2
๐ผ ๐๐๐๐ (๐ฅ๐ , ๐ฅ )
๐ฅ1 ๐ฅ2
๐๐ ๐(๐ฅ๐ , ๐ฅ)
๐ผ1 ๐ผ2
Learning random features helps a lot
58M parameters
32M parametersLearn and basis adaptation
AlexNetErro
r42.6
%
Error
43.7%
1000
40964096256
13
13
25613
13
32K
1000
Fix22/50
ScalableKernel Method
23
Learning convolution together helps more
58M parameters
32M parametersLearn and basis adaptation
AlexNetErro
r42.6
%
Error
41.9%
1000
40964096256
13
13
25613
13
32K
1000
Jointly learn
ScalableKernel Method
Lesson learned:Exploit Structure & Train
End-to-EndDeep learning over (time-
varying) graph
25
Co-evolutionary features
ChristineAliceDavid Jacob
02/0
2
Item embedding
User embedding
User-item interactions evolve over time
โฆ
26
ChristineAliceDavid Jacob
02/0
2
03/02
User embedding
Co-evolutionary featuresItem embedding
User-item interactions evolve over time
โฆ
27
ChristineAliceDavid Jacob
02/0
2
03/02
User embedding
06/02
Co-evolutionary featuresItem embedding
User-item interactions evolve over time
โฆ
28
ChristineAliceDavid Jacob
02/0
2
03/02
Item embedding
User embedding
06/02
07/02
Co-evolutionary features
User-item interactions evolve over time
โฆ
29
ChristineAliceDavid Jacob
02/0
2
03/02
Item embedding
User embedding
06/02
09/02
07/02
Co-evolutionary features
User-item interactions evolve over time
โฆ
30
ChristineAliceDavid Jacob
02/0
2
03/0206/02
09/02
07/02
Co-evolutionary featuresItem embedding
User embedding
User-item interactions evolve over time
โฆ
31
Co-evolutionary embedding
ChristineAliceDavid Jacob
Initialize item embedding
Initialize user embedding
(๐ข๐ ,๐๐ ,๐ก๐ ,๐๐)
Item raw profile features
User raw profile features
DriftContext
EvolutionCo-evolutionUser Item๐ ๐๐ (๐ก๐ )=h(
๐ 1 โ ๐ ๐๐ (๐ก๐โ )
+๐ 2โ ๐ ๐ข๐(๐ก๐โ)
+๐ 3 โ ๐๐
+๐ 4 โ (๐ก๐โ๐ก๐โ 1))Update
U2I:
DriftContext
EvolutionCo-evolutionItemUser๐ ๐ข๐
(๐ก๐ )=h(๐ 1โ ๐ ๐ข๐
(๐ก๐โ )+๐ 2 โ ๐ ๐๐ (๐ก๐
โ )+๐ 3 โ ๐๐
+๐ 4 โ (๐ก๐โ ๐ก๐โ1))Update
I2U:
[Dai et al. Recsys16]
32
Deep learning with time-varying computation graph
time
๐ก 2
๐ก 3
๐ก1
๐ก 0
Mini-batch 1
Computation graph of RNN determined by
1. The bipartite interaction graph
2. The temporal ordering of events
33
Much improvement prediction on Reddit dataset
Next item prediction Return time prediction
1,000 users, 1403 groups, ~10K interactionsMAR: mean absolute rank differenceMAE: mean absolute error (hours)
34
Predicting efficiency of solar panel materials
Dataset Harvard clean energy project
Data point #
2.3 million
Type MoleculeAtom type 6Avg node #
28
Avg edge #
33Power Conversion Efficiency (PCE)
(0 -12 %)predict
Organic Solar Panel Materials
35
Structure2Vec
๐2(1)
๐2(0)
๐1(0)๐3(1)๐1(1)
โฆโฆ
๐2(๐ ) ๐3(๐ )๐1(๐ )
๐ 6
๐ 1
๐ 2 ๐ 3
๐ 4
๐ 5
๐ป6
๐ป1
๐ป5
๐ป2 ๐ป
3
๐ป4
๐๐6(0)
โฆโฆ
โฆโฆ
Iteration 1:
Iteration :
Label classification/regressionwith parameter
Aggregate๐1(๐ )
๐2(๐ ) +ยฟ+ยฟโฎ
ยฟ๐๐(๐ , ๐ )
[Dai et al. ICML 16]
36
Improved prediction with small model
Structure2vec gets ~4% relative error with 10,000 times smaller model!
Test MAE Test RMSE
# parameter
sMean
predictor1.986 2.406 1
WL level-3 0.143 0.204 1.6 mWL level-6 0.096 0.137 1378 mstructure2v
ec0.085 0.117 0.1 m
10% data for testing
Take Home Message:
Deep fully connected layers not the key
Exploit structure (CNN, Coevolution, Structure2vec)
Train end-to-end