Download - Compacting ConvNets for end to end Learningjuxi.net/workshop/deeplearning-applications-vision...Compacting ConvNets for end to end Learning Jose M. Alvarez Joint work with Lars Pertersson,

NICTA Copyright 2012 From imagination to impact

Compacting ConvNets

for end to end Learning

Jose M. Alvarez

Joint work with Lars Pertersson, Hao Zhou, Fatih Porikli.


Success of CNN

Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton, ImageNet

Classification with Deep Convolutional Neural Networks, NIPS, 2012

Image Classification


Success of CNN

from Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, Faster R-CNN:

Towards Real-Time Object Detection with Region Proposal Networks,

arXiv:1506.01497

Object Detection


Success of CNN

Jifeng Dai, Kaiming He, Jian Sun, BoxSup: Exploiting Bounding Boxes to

Supervise Convolutional Networks for Semantic Segmentation, arXiv:1503.01640

Semantic Segmentation


Success of CNN

Andrej Karpathy, Li Fei-Fei, Deep Visual-Semantic Alignments for Generating

Image Description, CVPR, 2015

Image Captioning

Video classification …


Key of success

• Better training algorithms

– Batch normalization

– Initializations

– Momentum


Key of success


• Large amount of data / labels


Key of success



• Hardware / Storage

– GPU, parallel systems

0

2

4

6

8

10

12

14

GTX-580 Titan Black ('14) Titan X ('15)

Memory GPU (in Gb)


Key of success



• Hardware / Storage

• Larger community of researchers


Key of success

• Enabled larger networks

0

20

40

60

80

100

120

140

160

LeNet-5 AlexNet VGGNet-16

Num. Parameters (in Millions)


Key of success

0

50

100

150




Key of success

0

50

100

150




Key of success

0

50

100

150




Embedded devices with limited resources / power

Challenges

2014 –Jetson TK1 2015/16 –Jetson TX1


Embedded devices with limited resources / power

- Memory is a limiting factor

- Real time operation

Challenges


Computational Cost

Forward-pass is time consuming AlexNet


Computational Cost

Memory bottleneck AlexNet


Computational Cost

Memory bottleneck

conv3-64 x 2 : 38,720 conv3-128 x 2 : 221,440 conv3-256 x 3 : 1,475,328 conv3-512 x 3 : 5,899,776 conv3-512 x 3 : 7,079,424 fc1 : 102,764,544 fc2 : 16,781,312 fc3 : 4,097,000 TOTAL : 138,357,544

VGGNet


Do we need all these parameters?


Over-Parameterization

• ‘Needed for high non-convex optimization’ 1

Anna Choromanska, Mikael Henaff, Michael Mathieu, Gérard Ben Arous, Yann LeCun.

The Loss Surfaces of Multilayer Networks



• ‘Needed for high non-convex optimization’

• Deeper structures, larger learning capacity1

1 Guido Montúfar, Razvan Pascanu, Kyunghyun Cho, Yoshua Bengio. On the Number of

Linear Regions of Deep Neural Networks. NIPS 2014



• ‘Needed for high non-convex optimization’

• Deeper structures, larger learning capacity

• From images to Video -> Even larger nets?

A. Karpathy et. al. Large-scale Video Classification with Convolutional

Neural Networks. CVPR 2014.


Compacting CNN


Compacting CNN

• Network distillation

• Network pruning

• Structured parameters

– Ours


Compacting CNN



Compacting CNN


– Large network learns from data

– Generate labels using the trained network

– Train smaller nets using the output or soft layer

Geoffrey Hinton, Oriol Vinyals, Jeff Dean. Distilling the Knowledge in a Neural Network.

NIPSw 2015


Compacting CNN

• Network distillation (II)

– Use intermediate layers to guide the training

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo

Gatta and Yoshua Bengio. FitNets: Hints for Thin Deep Nets. ICLR 2015


Compacting CNN

• Pros

– In general better generalization and faster.

– Equal or slightly better performance

• Cons

– Requires a larger network to learn from.


Compacting CNN


• Network pruning

– Directly remove unimportant parameters during

training

• Requires second derivatives.

– Remove parameters + quantification1

• Good compression rates (orthogonal to other approaches)

1S. Han, H. Mao, and W. J. Dally. Deep compression: Compressing deep neural network

with pruning, trained quantization and huffman coding. CoRR, abs/1510.00149, 2015


Compacting CNN


• Network pruning

• Structured parameters


Compacting CNN: Structured parameters

Max Jaderberg, Andrea Vedaldi, Andrew Zisserman Speeding up Convolutional Neural

Networks with Low Rank Expansions. BMVC 2014

• Low rank approximations



Emily Denton, Wojciech Zaremba, Joan Bruna, Yann LeCun, Rob Fergus. Exploiting

Linear Structure Within Convolutional Networks for Efficient Evaluation. NIPS 2014

• Low rank approximations (II)



• Low rank approximations (III)

– Weights are approximated by a sum of rank 1

tensors.





• Weak-Points

– Needs a full-rank network completely trained

– Not all filters can be approximated

– Theoretical speeds-up with drop of performance.





• Weak-Points

– Needs a full-rank network completely trained.

– Not all filters can be approximated.

– Drop of performance.

• Strengths

– Potential ability to aid in regularization during or post

training.

– Parameter sharing within the layer.



K. Simonyan, A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image

Recognition. ICLR, 2015

• Low rank approximations (IV)

– VGG nets restrict filters during training.

– Same ‘receptive field’

– Deeper networks (more nonlinearities)

– Less parameters (49C2 vs 3x(3x3)C2 )



• Low rank approximations (Ours1)

– Filter restriction during training.

– Larger receptive fields

– Deeper networks (more nonlinearities)

– Parameter sharing

– Less parameters

1Joint work with Lars Pertersson. Under review



• Low rank approximations (Ours)

– ImageNet Results (AlexNet).

Baseline: Alex Krizhevsky. Ilya Sutskever. Geoffrey Hinton. ImageNet Classification with

Deep. Convolutional Neural Networks. NIPS 2012



• Low rank approximations (Ours)

– Stereo Matching.

Ours-1

32K

Ours-1

48K

Ours-3

32K

Baseline: Jure Zbontar, Yann LeCun. Computing the Stereo Matching Cost With a

Convolutional Neural Network. CVPR 2015


Memory?


Computational Cost

Memory bottleneck

conv3-64 x 2 : 38,720 conv3-128 x 2 : 221,440 conv3-256 x 3 : 1,475,328 conv3-512 x 3 : 5,899,776 conv3-512 x 3 : 7,079,424 fc1 : 102,764,544 fc2 : 16,781,312 fc3 : 4,097,000 TOTAL : 138,357,544

VGGNet


Computational Cost

Memory bottleneck AlexNet


Memory Bottleneck

• Sparse constraints during training (Ours2)

– Directly reduce the number of neurons.

– Select the optimum number of neurons.

– Significant memory reductions with minor drop of

performance

2Joint work with Hao Zhou, Fatih Porikli. Under review


Memory Bottleneck

• Sparse constraints during training (Ours2)

2Joint work with Hao Zhou, Fatih Porikli. Under review


Do we need all these parameters?


Compacting ConvNets

for end to end Learning

Jose M. Alvarez

Joint work with Lars Pertersson, Hao Zhou, Fatih Porikli.