Date post: | 15-Jul-2015 |
Category: |
Technology |
Upload: | dmitro-mishkin |
View: | 442 times |
Download: | 5 times |
convolutional neural networks for imageclassificationEvidence from Kaggle National Data Science Bowl.
Dmytro Mishkin, ducha.aiki at gmail comMarch 25, 2015
Czech Technical University in Prague
kaggle national data science bowl overview
The image classification problem
� 130,400 test images� 30,336 train images� 1 channel (grayscale)� 121 (biased) classess� 90% images ≤ 100x100 px
� logloss score = - 1N
N∑i=1
M∑j=1
yij log pij
� No external data
1
classes diagram
1
1url: http://npow.github.io/plankton/viewer/index.html.2
final leaderboard
3
Which approach to use?
4
lunch time chat at kth’s computer vision group
� a computer vision scientist: How long does it take to train thesegeneric features on ImageNet?
� Hossein: 2 weeks� Ali: almost 3 weeks depending on the hardware� the computer vision scientist: hmmmm...� Stefan: Well, you have to compare the three weeks to the last 40
years of computer vision2
2http://www.csc.kth.se/cvap/cvg/DL/ots/5
convolutional networks
CNNs are state-of-art in such fields of image recognition as:3:
� – Object Image Classification� – Scene Image Classification� – Action Image Classification� – Object Detection� – Semantic Segmentation� – Fine-grained Recognition� – Attribute Detection� – Metric Learning� – Instance Retrieval (almost).
3beat classic computer vision methods in 19 datasets out of 20http://www.csc.kth.se/cvap/cvg/DL/ots/
6
contents
1. Basics of convolutional networks2. Image preprocessing3. Network architectures4. Ensembling5. What (seems that) do and does not work6. Winner‘s solution highlights
7
..basics of convolutional net-works
what is convolution
4
4https://developer.apple.com/library/ios/documentation/Performance/Conceptual/vImage/ConvolutionOperations/ConvolutionOperations.html
9
softmax classifier
Softmax(cross-entropy) lossL = − log efyi∑
j
efj
SVM (hinge)lossL =
∑j̸=yi
max(0, f(xi, W)j − f(xi, W)yi +∆)
5
5http://vision.stanford.edu/teaching/cs231n/linear-classify-demo/10
lenet-5. no other layers are necessary
6
Firstly idea proposed by LeCun7 in 1989, recently revived bySpringenberg et. al. in ”All Convolutional Net”8,
6http://eblearn.sourceforge.net/beginner_tutorial2_train.html7url: https://www.facebook.com/yann.lecun/posts/10152766574417143.8J. T. Springenberg et al. “Striving for Simplicity: The All Convolutional Net”. In:
ArXiv e-prints (2014). arXiv: 1412.6806 [cs.LG].11
non-linearities
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
4
Input
Activation
TanH
Sigmoid
ReLU
maxout (sort of)
LeakyReLU
12
regularization - dropout, weight decay
9
9Nitish Srivastava et al. “Dropout: A Simple Way to Prevent Neural Networks fromOverfitting”. In: Journal of Machine Learning Research 15 (2014), pp. 1929–1958.url: http://jmlr.org/papers/v15/srivastava14a.html.
13
deep learning libraries
Table 1: Popular deep learning GPU libraries
Name url languages Notescaffe github.com/BVLC/caffe C++/Python/no largest communitycxxnet github.com/dmlc/cxxnet C++/no good memory managementTheano github.com/Theano/Theano Python huge flexibilityTorch facebook/fbcunn lua LeCun Facebook librarycuda-convnet2 code.google.com/p/cuda-convnet2/ C++/pythonSparseConvNet http://tinyurl.com/pu65cfp C++/CUDA differs from others
14
..image preprocessing
basic network architecture
72x72x1 → Crop to 64x64 →20C5 →MP2 →50C5 → MP2 →500IP → clf
16
basic data preprocessing
Table 2: 5-layer network experiments, 48x48 input image, no non-linearities,mean pixel extraction
Name, augmentation Val logloss Train loglossNo mean extraction, no scaling – –mirror 1.67 0.64histeq, mirror 1.74 0.64mirror + ReLU 1.61 0.44mirror + scale 1.42 0.937mirror + scale LeakyReLU 1.34 0.83mirror + rand rot 1.53 1.31
17
basic data preprocessing
Table 3: 5-layer network experiments, 48x48 input image, LeakyReLUnon-linearities, mean pixel extraction
Name, augmentation Val logloss Train loglossmirror + scale 1.34 0.83invert, mirror + scale 1.27 0.80invert, norm, mirror + scale 1.24 0.505invert, norm, mirror + scale, salt-pepper 1.15 n/a
18
more geometric transformations
Table 4: 5-layer network experiments, 64x64 input image, LeakyReLU
Name, augmentation Val loglossmirror 1.30mirror + scale (resize modes) 1.12h+v mirror, scale 1.10h+v mirror, scale + rot 1.08mirror, less baselr 1.04 :)
h+v mirror, scale + rot, depolar imgs 1.28
19
regularization methods
Table 5: 5-layer network experiments, 64x64 input image, LeakyReLU
Name, augmentation Val loglossh+v mirror, scale + rot, vanilla 1.08h+v mirror, scale + rot, PReLU (but slow down a lot)10 1.03h+v mirror, scale + rot, BatchNorm11 1.10h+v mirror, scale + rot, StochPool12 0.98
10K. He et al. “Delving Deep into Rectifiers: Surpassing Human-Level Performance onImageNet Classification”. In: ArXiv e-prints (2015). arXiv: 1502.01852 [cs.CV].11S. Ioffe and C. Szegedy. “Batch Normalization: Accelerating Deep Network Training byReducing Internal Covariate Shift”. In: ArXiv e-prints (2015). arXiv: 1502.03167[cs.LG].12M. D. Zeiler and R. Fergus. “Stochastic Pooling for Regularization of DeepConvolutional Neural Networks”. In: ArXiv e-prints (2013). arXiv: 1301.3557 [cs.LG].
20
data augmentation - don‘t forget about it during test time
for i = 0,90,180,270 degrees rotationfor 9 crops (N, NE, E, ...)
get predictions for mirrored/non-mirrored
21
..network architectures
cifar/lenet for testing
Pro‘s
� + Training time < 20 min� + Can be done in parallel� + therefore lots of experiments
Con‘s
� - Not complex enough to check smth (i.e. BatchNorm)� - That is why might lead to wrong conclusions about ”bad” things (i.e.
random rotations hurts CifarNets, but helps VGGNets)� - Or ”good” things (i.e. Stochastic pooling helps CifarNets, but none
for VGGNets)
23
We need to go deeper
24
googlenet
GoogLeNet architecture13
13C. Szegedy et al. “Going Deeper with Convolutions”. In: ArXiv e-prints (2014).arXiv: 1409.4842 [cs.CV].
25
googlenet
22 layers, but simple base brick – ”Inception”
26
internal ensemble
Take mean of all auxiliary classifiers instead of just throwing away them
Table 6: GoogLeNet,validation loss
Name Public LBclf on inc3 0.722clf on inc4a 0.754clf on inc4b 0.757clf on inc5b 0.855average 0.693
Table 7: VGGNet,validation loss
Name Public LBclf on pool4 0.762clf on pool5 0.657clf on fc7 0.707average 0.630
14
14J. Xie, B. Xu, and Z. Chuang. “Horizontal and Vertical Ensemble with DeepRepresentation for Classification”. In: ArXiv e-prints (2013). arXiv: 1306.2759[cs.LG].
27
googlenet-results
Table 8: GoogLeNet, 64x64 input image, Leaky ReLU (if not stated other),AlexNet-oversample
Name Public LBNo inv, scale, ReLU, last-clf 0.910No inv, scale, ReLU 0.859No inv, scale 0.816No inv scale, maxout-clf 0.785Inv, scale, maxout-clf, retrain 0.70396x96, inv, scale, maxout-clf, retrained, no-aug-ft15 0.684112x112, inv, scale, maxout-clf, retrained, no-aug-ft. 0.71648x48, inv, scale, maxout-clf, retrained, no-aug-ft. + test rot 0.74996x96, inv, scale, maxout-clf, retrained, no-aug-ft. + test rot 0.67948x48+96x96+112x112, inv, scale, maxout-clf, retrained, no-aug-ft 0.677
15Ben Graham‘s trick: finetune converged model for 1-5 epochs withoutdata-augmentation with small lrhttp://blog.kaggle.com/2015/01/02/cifar-10-competition-winners-interviews-with-dr-ben-graham-phil-culliton-zygmunt-zajac/
28
vggnet
VGGNet architectures16
Differences: Dropout in conv-layers (0.3), SPP-pooling for pool5, LeakyReLU,aux. clf.
16K. Simonyan and A. Zisserman. “Very Deep Convolutional Networks for Large-ScaleImage Recognition”. In: ArXiv e-prints (Sept. 2014). arXiv: 1409.1556 [cs.CV]. 29
spatial pyramid pooling
17
17K. He et al. “Spatial Pyramid Pooling in Deep Convolutional Networks for VisualRecognition”. In: ArXiv e-prints (2014). arXiv: 1406.4729 [cs.CV].
30
vggnet-results
Table 9: GoogLeNet, 64x64 input image, Leaky ReLU (if not stated other),AlexNet-oversample, no-SPP
Name Public LBNo inv, scale, ReLU, fc-maxout 0.752Inv, scale, single random crop 0.773Inv, scale, 50 random crops 0.751Inv, scale, 0.729Inv, scale, retrained 0.720Inv, scale, fc-maxout 0.662Inv, scale, fc-maxout, SPP 0.654All VGGNets Mix 0.650
31
sparseconvnet
� – 0.79 LB Score� – Unusual library� – C2 instead of C3 convolution� – Only padding - for input image
� – Kaggle CIFAR-10 winning architecture
320C2 - 320C2 - MP2 -640C2 - 10% dropout - 640C2 - 10% dropout - MP2 -960C2 - 20% dropout - 960C2 - 20% dropout - MP2 -1280C2 - 30% dropout - 1280C2 - 30% dropout - MP2 -1600C2 - 40% dropout - 1600C2 - 40% dropout - MP2 -1920C2 - 50% dropout - 1920C1 - 50% dropout - 121C1 - Softmax output
32
ensemble-results
Table 10: Different mixes of all modes (3 GoogleNets, 4 VGGNets, 1SparseConvNet)
Name Public LB Private LB4 VGG 0.650 0.6513 VGG, 1 GLN 0.625 0.6294 VGG, 3 GLN 0.617 0.6184 VGG, 3 GLN, 1 Sparse 0.611 0.6164 VGG, 3 GLN, 1 Sparse, figure-skating 0.609 0.613
33
..misc
batchnorm
Works for CIFAR
But no big difference for VGGNet in KNDB for me. However, works forother people, i.e. Jae Hyun Lim18, 22nd place18https://github.com/lim0606/ndsb
35
what else seems to work here
� – Retrain top layers with different non-linearity (cheat diversity)� – Figure-skating average – throw away max and min prediction (0.003
LB score)
36
what seems, that does not work here
� – Dense SIFT + BOW / Fisher Vector 6̃0% accuracy� – Random forest on CNN features 6̃5% accuracy� – Mix of Hinge and Cross-Entropy losses� – Averaging with other mean than arithmetical� – Image enhancement or preprocessing (histogram equalization, etc.)
37
..winner‘s solution highlights
team work
� – Roll-pool
� – Hand-engineered features� – RMS-Pool� – Knowledge distillation
19
19http://benanne.github.io/2015/03/17/plankton.html39
Questions?
40
thanks
This nice presentation theme is taken from
github.com/matze/mtheme
The theme itself is licensed under a Creative CommonsAttribution-ShareAlike 4.0 International License.
cba
41