Date post: | 11-Apr-2017 |
Category: |
Technology |
Upload: | bill-liu |
View: | 42 times |
Download: | 0 times |
Li ZhangGoogle
Prevalent in computer vision in last 5 years} Image classification} Object detection} Image segmentation} Image captioning} Visual question answer} Image synthesis} ...
More data, more compute => bigger models, better resultsOkay as a cloud serviceMore challenging on mobile or even embedded devices.
} Reduce number of channels in each layer
} Low rank decomposition
◦ Speeding up convolutional neural networks with low rank expansions, BMCV 2014
◦ Efficient and Accurate Approximations of Nonlinear Convolutional Networks, CVPR 2015
◦ ResNet, Inception
} Reduce connections
◦ Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding. ICLR, 2016. 5
} Cascade classifier
◦ A convolutional neural network cascade for face detection. CVPR, 2015.◦ Exploit all the layers: Fast and accurate cnn object detector with scale dependent pooling
and cascaded rejection classifiers. CVPR, 2016.
} Glimpse-based attention model
◦ Learning to combine foveal glimpses with a third-order boltzmann machine. NIPS, 2010◦ Recurrent models of visual attention. NIPS, 2014. ◦ Multiple object recognition with visual attention. ICLR, 2015.◦ Spatial transformer networks. NIPS, 2015
} Glimpse-based attention model
◦ Learning to combine foveal glimpses with a third-order boltzmann machine. NIPS, 2010◦ Recurrent models of visual attention. NIPS, 2014. ◦ Multiple object recognition with visual attention. ICLR, 2015.◦ Spatial transformer networks. NIPS, 2015
} Switch on/off subnets◦ Dynamic capacity networks. ICML, 2016.
◦ PerforatedCNNs: Acceleration through elimination of redundant convolutions. NIPS, 2016.◦ Conditional computation in neural networks for faster models. ICLR Workshop, 2016.
◦ Branchynet: Fast inference via early exiting from deep neural networks. ICPR, 2016.
} Switch on/off subnets◦ Dynamic capacity networks. ICML, 2016.
◦ PerforatedCNNs: Acceleration through elimination of redundant convolutions. NIPS, 2016.
◦ Conditional computation in neural networks for faster models. ICLR Workshop, 2016.
◦ Branchynet: Fast inference via early exiting from deep neural networks. ICPR, 2016.
} Switch on/off subnets◦ Dynamic capacity networks. ICML, 2016.
◦ PerforatedCNNs: Acceleration through elimination of redundant convolutions. NIPS, 2016.
◦ Conditional computation in neural networks for faster models. ICLR Workshop, 2016.
◦ Branchynet: Fast inference via early exiting from deep neural networks.
◦ ICPR, 2016.
http://mscoco.org/explore/?id=19431
https://en.wikipedia.org/wiki/Westphalian_horse
s1State
t=1
s2F
t=2
s3F
t=3
s4F
t=4
s5F
t=5
s6F
t=6
Output: s6
Consider an RNN: output = state
https://arxiv.org/abs/1603.08983
s1State
t=1
Halting probability
0.01
RNN-ACT
Cumulative sum: 0.01Output: 0.01 s1 + 0.1 s2 + 0.7 s3 + 0.19 s4 Ponder cost �: 1 + 0.19
Remainder1 - 0.01 - 0.1 - 0.7
Differentiable w.r.t.halting probabilities!
https://arxiv.org/abs/1603.08983
H
s2F
t=2
0.1
+ 0.1
+ 1
H
+ 0.7
s3F
t=3
0.7
+ 1
H
s4F
t=4
0.5
+ 0.5 > 1- 𝜀
+ 1
H
Residual block:
http://arxiv.org/abs/1512.03385
group group group
avg. pool + fc
image
https://arxiv.org/abs/1603.09382
avg. pool + fc
https://arxiv.org/abs/1603.09382
avg. pool + fc
https://arxiv.org/abs/1603.09382
avg. pool + fc
Powerful regularizerRepresentations of the layers are compatible with each other
https://arxiv.org/abs/1603.09382
avg. pool + fc
Ponder cost �: 4.7
s1
halting probability 0.1
F1
H1
0.1
s2F2
H2
0.1
s3F3
H3
0.9
s4F4
H4
group of residual blocks
s5F5
0.1 s1 + 0.1 s2+ 0.1 s3 + 0.7 s4
output
input si - ResNet block activation
High ponder cost
Low ponder cost
ResNet-110, 𝜏 = 0.01
High ponder cost
Low ponder cost
ResNet-101, 𝜏 = 0.001
1.0 1.0
0.40.2
halting probability
update
copy from previous
H1
group of residual blocks
F1 F3
s3s1
H2
0.70.1
F2
s2 0.6
0.4
outputinput
copy copy
residual block
Strict generalization of ACT (consider zero weights for 3x3 conv)
global avg-pooling
3x3 conv
Linear model
add𝛔
sihi
Two ways to train the model:● From scratch● Warm-up with a pretrained model (following results use this)
Important trick: initialize biases of halting probabilities with negative values
ResNet-110, 𝜏 = 0.01
ResNet-101, 𝜏 = 0.005
ResNet-101, 𝜏 = 0.005
ResNet-101, 𝜏 = 0.005
Suppose that the average number of blocks used in the groups is 3 - 3.9 - 13.7 - 3
Baseline: train a ResNet with 3 - 4 - 14 - 3 blocks from scratch with “warming up” from ResNet-101 network
Apply the models to images of higher resolution than the training setSACT improves scale invariance
● Train on ImageNet classification, fine-tune on COCO detection● Apply ponder cost penalty to the feature extractor
Model mAP Feature Extractor FLOPS
ResNet v2 101
29.24 100%
SACT, 𝜏 = 0.001
29.04 72.44%
SACT, 𝜏 = 0.005
27.61 55.98%
cat2000 dataset● No explicit supervision for attention!● No center prior
Model AUC-Judd
ImageNet SACT𝜏 = 0.005
77%
COCO SACT𝜏 = 0.005
80%
One human 65%
Center prior 83%
State of the art
87%Kudos to Maxwell for evaluating!
Middle of leaderboard
http://saliency.mit.edu/home.html
“Spatially Adaptive Computation Time for Residual Networks” to appear in CVPR 2017, https://arxiv.org/pdf/1612.02297.pdf
● The idea of Adaptive Computation Time can be successfully usedfor computer vision
● Adaptive Computation Time○ Dynamic number of layers in ResNet
● Spatially Adaptive Computation Time○ Dynamic number of layers for different parts of image○ Attention maps for free :)
● Both models○ Reduce the amount of computation○ Can be implemented efficiently○ Work on ImageNet classification (first attention models with this property?)○ Work on COCO detection