Date post: | 13-Apr-2017 |
Category: |
Data & Analytics |
Upload: | gioele-ciaparrone |
View: | 410 times |
Download: | 1 times |
Modern Convolutional Neural Network
techniques for image segmentation
Deep Learning Journal Club
Gioele Ciaparrone
Michele Curci
November 30, 2016
University of Salerno
Index
1. Introduction
2. The Inception architecture
3. Fully convolutional networks
4. Hypercolumns
5. Conclusion
2
Introduction
CNN recap
Sequence of convolutional and pooling layers Rectifier activation function Fully connected layers at the end Softmax function for classification
4
Convolution I
5
Convolution II
Valid padding (left) and same padding (right) convolutions
6
LeNet-5 (1989-1998)
First CNN (1989) proven to work well, used for handwritten Zipcode recognition [1]
Refined through the years until the LeNet-5 version (1998) [2]
7
LeNet-5 interactive visualization [3]
Its possible to interact with the network in 3D, manually drawing a digit
to be classified, clicking on the neurons to get info about the parameters
and the connected units, or rotating and zooming the network:
http://scs.ryerson.ca/~aharley/vis/conv/ 8
http://scs.ryerson.ca/~aharley/vis/conv/
AlexNet (2012) [5]
After a long hiatus in which deep learning was ignored [4], theyreceived attention once again after Alex Krizhevsky overwhelmingly
won the ILSVRC in 2012 with AlexNet
Structure very similar to LeNet-5, but with some new key insights:very efficient GPU implementation, ReLU neurons and dropout
9
The Inception architecture
Motivations
Increasing model size tends to improve quality
More computational resources are needed
Computational efficiency and low parameter count are still important
Mobile vision and embedded systems
Big Data
11
Going Deeper with Convolutions [6]
The Inception module solves this problem making a better use of thecomputing resources
Proposed in 2014 by Christian Szegedy and other Google researchers
Used in the GoogLeNet architecture that won both the ILSVRC2014 classification and detection challanges
12
Inception module I
Visual information is processed at various scales and then aggregated Since pooling operations are beneficial in CNNs, a parallel pooling
path has been added
Problems: 3x3 and 5x5 convolutions can be very expensive on top of a layer
with lots of filters
The number of filters substantially increases for each Inception layeradded, leading to a computational blow up 13
Inception module II
Adding the 1x1 convolutions before the bigger convolutions reducesdimensionality
The same is done after the pooling layer
14
GoogLeNet I
GoogLeNet is a particular incarnation of the Inception architecture
22 convolutional layers (27 including pooling)
9 Inception modules
2 auxiliary classifiers to solve the vanishing gradient problem and forregularization
Designed with computational efficiency in mind Inference can be run on devices with limited computational
resources, especially memory
7 of these networks used in an ensemble for the ILSVRC 2014classification task
15
GoogLeNet II
16
GoogLeNet III
17
GoogLeNet - Training
Trained with the DistBelief distributed machine learning system
Asynchronous stochastic gradient descent with 0.9 momentum
Image sampling methods have changed many times before thecompetition
Converged models were trained on with other options
Models were trained on crops of different size
There isnt a definitive guidance to the most effective single way totrain these networks
18
GoogLeNet - ILSVRC 2014 Results
Classification (above) and object detection (below) results.19
DeepDream
Googles DeepDream uses a GoogLeNet to produce machine dreams
20
Inception-v2 and Inception-v3
The Inception module authors later presented new optimizedversions of the architecture, called Inception-v2 and Inception-v3 [7]
They managed to significantly improve GoogLeNet ILSVRC 2014results
The improvements were based on various key principles: Avoid representational bottlenecks Spatial aggregation on lower dimensional embeddings doesnt usually
induce relevant losses in representational power
Balance the width and depth of the network
21
Convolution factorization I
Factorizing convolutions allows to reduce the number of parameterswhile not loosing much expressiveness
For example 5x5 convolutions can be factorized into a pair of 3x3convolutions
It is also possible to factorize a NxN convolutions into a 1xN and aNx1 convolutions
22
Convolution factorization II
The original Inception module (left) and the new factorized module
(right).
23
Efficient grid size reduction - problem
Suppose we want to pass from a d d grid with k filters to a d2 d2
grid with 2k filters
We need to compute a stride-1 convolution and then a pooling Computational cost dominated by convolutions: 2d2k2 operations Inverting the order, the number of operations is reduced to 2( d2 )
2k2,
but we violate the bottleneck principle
24
Efficient grid size reduction - solution
The solution is an Inception module with convolution and poolingblocks with stride 2
Computationally efficient and no representational bottleneckintroduced
25
The new architecture
Using various modified Inception modules, here is the newInception-v2 architecture
26
Inception-v2: modules used
n = 7
27
Inception-v2: training and observations
The network was trained on the ILSVRC 2012 images usingstochastic gradient descent and the TensorFlow library
Experimental testings proved the two auxiliary classifiers to have lessimpact on the training convergence than expected
In the early training phases, the model performance was not affectedby the presence of the auxiliary classifiers: they only improved the
performance near the end of training
Removing the lower auxiliary classifier didnt have any effect
The main classifier performs better if batch normalization or dropoutare added to the auxiliary ones
The model was also trained and tested on smaller receptive fieldswith only a small loss of top-1 accuracy (76.6% for 299x299 RF vs.
75.2% on 79x79 RF). Important for post-classification of detection
28
Inception-v2 to Inception-v3 results (single model)
Each rows Inception-v2 model adds a feature with respect to theprevious rows model
The last lines model is referred to as the Inception-v3 model29
Inception-v3 vs other models (single and ensemble)
Single model results Ensemble results
On the ILSVRC 2012 dataset, there is a significant improvementversus state-of-the-art models, both with a single model and with an
ensemble of models
Note that the ensemble errors here are validation errors (except forthe one marked with *, that is a test error)
30
Fully convolutional networks
Semantic segmentation
Image segmentation is the process of partitioning an image inmultiple segments (set of pixels or super-pixels)
Semantic segmentation is the partitioning of an image intosemantically meaningful parts and to classify each part into one of
the pre-determined classes
Its possible to achieve the same result with pixel-wiseclassification, i.e. assigning a class to each pixel
32
Fully convolutional networks
Shelhamer et al. [8] showed that fully convolutional networks trainedpixels-to-pixels exceed the state-of-the-art in semantic segmentation
The fully convolutional networks they proposed take input ofarbitrary size and produce same-sized output to make dense
predictions
33
Convolutionalization of a classic net I
Typical recognition nets (AlexNet, GoogLeNet, etc.) take fixed-sizedinputs and produce non-spatial outputs
The fully connected layers have fixed dimensions and drop thespatial coordinates
However we can view these fully connected layers as convolutionsthat cover their entire input regions
34
Convolutionalization of a classic net II
These fully convolutional networks take input of any size and outputclassifications map
The resulting maps are equivalent to the evaluation of the originalnetwork on particular input patches
The new network is more than 5 times faster than the originalnetwork both at learning time and at inference time (considering a
10x10 output grid)
Note that the output dimensions are typically reduced bysubsampling
So output interpolation is needed to obtain dense predictions
The interpolation is obtained through backwards convolutions
35
Backwards strided convolution
Upsampling from 3x3 grid to 5x5
36
Architecture I
Coarse and local information is fused combining lower and higherlayers
3 network types with different layers fused were tested
37
Architecture II
3 proven classification architectures were transformed to fullyconvolutional: AlexNet, VGG16 and GoogLeNet
Each nets final classifier layer is discarded and all the fullyconnected layers are converted to convolutions
A 1x1 convolution with 21 channels (the number of classes in thePASCAL VOC 2011 dataset) is added to the end, followed by a
backwards convolution layer
38
Architecture III
The original nets were first pre-trained using image classification
Then they were transformed to fully convolutional for fine tuningusing whole images (using SGD with momentum)
The best results were obtained with FCN-VGG16
Training on whole images proved to be as effective as samplingpatches
39
Architecture comparison
The first models (FCN-32s) didnt fuse different layers, but theresulting output is very coarse
They then fused lower layers with the last one (as shown earlier) toobtain better results (mean IU 62.7 for FCN-8s vs. 59.4 for
FCN-32s)40
Results comparison I
The model reaches state-of-the-art performance on semanticsegmentation
Also the model is much faster at inference time than previousarchitectures
41
Results comparison II
42
Hypercolumns
Hypercolumns I
The last layer of a CNN captures general features of the image, butis too coarse spatially to allow precise localization
Earlier layers instead may be precise in localization but will notcapture semantics
Hariharan et al. [9] presented the hypercolumn concept, which putstogheter the information from both higher and lower layers to obtain
better results on 3 fine-grained localization tasks:
Simultaneous detection and segmentation Keypoint localization Part labeling
44
Hypercolumns II
The hypercolumn corresponding to a given input location is definedas the outputs of all units above that location at all layers of the
CNN, stacked into one vector
45
Problem setting I
Input: a set of detections (subjected to non-maximum suppression),each with a bounding box, a category label and a score
According to the task we are performing for each detection we want: segment out the object segment its parts predict its keypoints
Whichever the task, the bounding boxes are slightly expanded and a50x50 heatmap is predicted on each of them
46
Problem setting II
The information encoded in each heatmap and the number ofheatmaps depend on the chosen task:
For segmentation, the heatmap encodes the probability that aparticular location is inside the object
For part labeling a separate heatmap is predicted for each part,where each heatmap is the probability a location belongs to that part
For keypoint localization a separate heatmap is predicted for eachkeypoint, with each heatmap encoding the probability that the
keypoint is at a particular location
The heatmaps are finally resized to the size of the expandedbounding boxes
So all the tasks are solved assigning a probability to each of the50x50 locations
47
Problem setting III
For each of the 50x50 locations and for each category a classifiershould be trained
But doing so has 3 problems: The amount of data that each classifier sees during training is
heavily reduced
Training so many classifiers is computationally expensive While the classifier should vary according to the location, to adjacent
pixels should be classified similarly
The solution is to train a coarse K K (usually K = 5 or K = 10)grid of classifiers and interpolate between them
48
Network architecture
conv conv conv
upsample upsample upsample
sigmoid
classifier interpolation
Note: inverting the order of upsampling and convolutions (that calculate
the K K grids) and computing them separately for each of the 3combined layers allows to reduce computational cost
49
Bounding box refining
A special technique is used to improve the box selection, calledrescoring
50
SDS results
51
Keypoint prediction results
52
Part labeling results
53
Conclusion
Conclusion
We have seen how the Inception modules allow to train deeper andbetter networks in a computationally efficient manner
We have then observed how to transform a classification CNN into afully convolutional network for pixel-wise classification
We have learned the hypercolumn technique to combine high andlow level information to improve the accuracy on various fine-grained
localization tasks
55
Thank you for your patience! :)
56
References I
[1] Y. LeCun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard,
W. Hubbard, and L. D. Jackel, Backpropagation applied to
handwritten zip code recognition, Neural Computation, vol. 1(4),
pp. 541551, 1989.
[2] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, Gradient-based
learning applied to document recognition, Proc. IEEE, vol. 86,
pp. 22782324, 1998.
[3] A. W. Harley, An interactive node-link visualization of convolutional
neural networks, in ISVC, pp. 867877, 2015.
[4] A. Kurenkov, A brief history of neural nets and deep learning, part
4. http://www.andreykurenkov.com/writing/
a-brief-history-of-neural-nets-and-deep-learning-part-4/.
57
http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4/http://www.andreykurenkov.com/writing/a-brief-history-of-neural-nets-and-deep-learning-part-4/
References II
[5] A. Krizhevsky, I. Sutskever, , and G. Hinton, Imagenet classification
with deep convolutional neural networks, Advances in Neural
Information Processing Systems, vol. 25, pp. 11061114, 2012.
[6] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. E. Reed, D. Anguelov,
D. Erhan, V. Vanhoucke, and A. Rabinovich, Going deeper with
convolutions, CoRR, vol. abs/1409.4842, 2014.
[7] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna,
Rethinking the inception architecture for computer vision, CoRR,
vol. abs/1512.00567, 2015.
[8] E. Shelhamer, J. Long, and T. Darrell, Fully convolutional networks
for semantic segmentation, CoRR, vol. abs/1605.06211, 2016.
58
References III
[9] B. Hariharan, P. A. Arbelaez, R. B. Girshick, and J. Malik,
Hypercolumns for object segmentation and fine-grained
localization, CoRR, vol. abs/1411.5752, 2014.
59
IntroductionThe Inception architectureFully convolutional networksHypercolumnsConclusion
fd@rm@0: fd@rm@1: