Post on 03-Feb-2021
transcript
AttentionNet for Accurate Localization and Detection of Objects. (To appear in ICCV 2015)
Donggeun Yoo, Sunggyun Park, Joon-Young Lee, Anthony Paek, In So Kweon.
State-of-the-art frameworks for object detection.
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
Object proposal.
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
CN
N
Object proposal.
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
CN
N
SVM
Object proposal.
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
CN
N
SVM
NM
S
BB R
eg.
Object proposal.
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
CN
N
SVM
NM
S
BB R
eg.
Object proposal.
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
(−) The maximally scored region is prone to focus on discriminative part (e.g. face)
rather than entire object (e.g. human body).
CN
N
SVM
NM
S
BB R
eg.
Object proposal.
State-of-the-art frameworks for object detection.
1. Region-CNN framework. [Gkioxari et al., CVPR’14]
(−) The maximally scored region is prone to focus on discriminative part (e.g. face)
rather than entire object (e.g. human body).
CN
N
SVM
NM
S
BB R
eg.
Object proposal.
State-of-the-art frameworks for object detection.
2. Detection by CNN-regression. [Szegedy et al., NIPS’13]
State-of-the-art frameworks for object detection.
2. Detection by CNN-regression. [Szegedy et al., NIPS’13]
State-of-the-art frameworks for object detection.
2. Detection by CNN-regression. [Szegedy et al., NIPS’13]
CN
N
X1
y1
X2
y2
State-of-the-art frameworks for object detection.
2. Detection by CNN-regression. [Szegedy et al., NIPS’13]
(X1,Y1)
(X2,Y2)
CN
N
X1
y1
X2
y2
State-of-the-art frameworks for object detection.
2. Detection by CNN-regression. [Szegedy et al., NIPS’13]
(−) Direct mapping from an image to an exact bounding box is relatively difficult for a CNN.
(X1,Y1)
(X2,Y2)
CN
N
X1
y1
X2
y2
Idea: Ensemble of weak prediction.
Idea: Ensemble of weak prediction.
Idea: Ensemble of weak prediction.
Idea: Ensemble of weak prediction.
Idea: Ensemble of weak prediction.
Idea: Ensemble of weak prediction.
Stop signal
Idea: Ensemble of weak prediction.
Stop signal
Idea: Ensemble of weak prediction.
Stop signal
Stop signal
Idea: Ensemble of weak prediction.
Stop signal
Stop signal
Idea: Ensemble of weak prediction.
Model: Rather than CNN regression model,
use CNN classification model.
Model: Rather than CNN regression model,
use CNN classification model.
Bottom-right direction prediction. Top-left direction prediction.
Convolution.
Normalization.
Pooling.
Convolution.
Normalization.
Pooling.
Convolution.
Convolution.
Convolution.
Fully connected.
Fully connected.
Model: Rather than CNN regression model,
use CNN classification model.
Bottom-right direction prediction. Top-left direction prediction.
Convolution.
Normalization.
Pooling.
Convolution.
Normalization.
Pooling.
Convolution.
Convolution.
Convolution.
Fully connected.
Fully connected.
Model: Rather than CNN regression model,
use CNN classification model.
[ 3 directions, stop signal, no object ] ∈ ℜ5
Bottom-right direction prediction. Top-left direction prediction.
Convolution.
Normalization.
Pooling.
Convolution.
Normalization.
Pooling.
Convolution.
Convolution.
Convolution.
Fully connected.
Fully connected.
[ 3 directions, stop signal, no object ] ∈ ℜ5
Model: Rather than CNN regression model,
use CNN classification model.
[ 3 directions, stop signal, no object ] ∈ ℜ5
Convolution.
Normalization.
Pooling.
Convolution.
Normalization.
Pooling.
Convolution.
Convolution.
Convolution.
Fully connected.
Fully connected.
[ 3 directions, stop signal, no object ] ∈ ℜ5
→ ↘ ↓ • F ← ↖ ↑ • F
Iterative test: Ensemble of weak directions.
Iterative test: Ensemble of weak directions.
Iterative test: Ensemble of weak directions.
Iterative test: Ensemble of weak directions.
Iterative test: Ensemble of weak directions.
Iterative test: Ensemble of weak directions.
Iterative test: Ensemble of weak directions.
Iterative test: Ensemble of weak directions.
Training AttentionNet.
Training AttentionNet.
1. Generating training samples.
Training AttentionNet.
2. Minimizing the loss function by back-propagation and stochastic gradient descent.
𝐿 =1
2𝐿𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑦𝑇𝐿, 𝑡𝑇𝐿 +
1
2𝐿𝑠𝑜𝑓𝑡𝑚𝑎𝑥 𝑦𝐵𝑅 , 𝑡𝐵𝑅 .
Result. (Good examples.)
Result. (Good examples.)
Result. (Bad examples.)
How to detect multiple instance?
Extension to multiple-instance: 1. Fast multi-scale sliding window search
using fully-convolutional network.
*Fast extraction of multi-scale dense activations.
*Fast extraction of multi-scale dense activations.
Conv. 5
Conv. 4
Conv. 3
Conv. 2
Conv. 1
FC 8
FC 7
FC 6
227×227×3
*Fast extraction of multi-scale dense activations.
Conv. 5
Conv. 4
Conv. 3
Conv. 2
Conv. 1
FC 8
FC 7
FC 6
Conv. 5
Conv. 4
Conv. 3
Conv. 2
Conv. 1
FC 8
FC 7
FC 6
227×227×3
322×322×3
*Fast extraction of multi-scale dense activations.
Idea: Fully connection can be equally implemented
by convolutional layer.
Conv. 5
Conv. 4
Conv. 3
Conv. 2
Conv. 1
FC 8
FC 7
FC 6
Conv. 5
Conv. 4
Conv. 3
Conv. 2
Conv. 1
FC 8
FC 7
FC 6
227×227×3
322×322×3
*Fast extraction of multi-scale dense activations.
Idea: Fully connection can be equally implemented
by convolutional layer.
Conv. 5
Conv. 4
Conv. 3
Conv. 2
Conv. 1
FC 8
FC 7
FC 6
Conv. 5
Conv. 4
Conv. 3
Conv. 2
Conv. 1
FC 8
Conv. 7
Conv. 6
227×227×3
322×322×3
*Fast extraction of multi-scale dense activations.
…
*Fast extraction of multi-scale dense activations.
…
*Fast extraction of multi-scale dense activations.
…
…
*Fast extraction of multi-scale dense activations.
Multi-scale
dense
activations.
…
…
…
4,096
*Fast extraction of multi-scale dense activations.
Multi-scale
dense
activations.
…
…
4,096
Each activation vector
comes from each patch.
Extension to multiple-instance: 1. Fast multi-scale sliding window search
using fully-convolutional network.
Extension to multiple-instance:
2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.
Extension to multiple-instance:
2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.
Satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Start iterative test.
Extension to multiple-instance:
2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.
Un-satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Reject.
Satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Start iterative test.
Extension to multiple-instance:
2. Early rejection with {↘𝑇𝐿, ↖𝐵𝑅} constraint.
Un-satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Reject.
Un-satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Reject.
Satisfying {↘𝑇𝐿, ↖𝐵𝑅}: Start iterative test.
Extension to multiple-instance: Overall architecture for sliding window search.
Extension to multiple-instance: Merging multiple bounding boxes.
Extension to multiple-instance: Merging multiple bounding boxes.
Extension to multiple-instance: Merging multiple bounding boxes.
Extension to multiple-instance: Merging multiple bounding boxes.
Extension to multiple-instance: Merging multiple bounding boxes.
Evaluation on PASCAL VOC Series.
PASCAL VOC 2007 “Person”.
PASCAL VOC 2012 “Person”.
58.7 RCNN.
RCNN-based.
Evaluation on PASCAL VOC Series.
PASCAL VOC 2007 “Person”.
PASCAL VOC 2012 “Person”.
58.7 RCNN.
RCNN-based.
AttentionNet.
AttentionNet.
Evaluation on PASCAL VOC Series.
PASCAL VOC 2007 “Person”.
PASCAL VOC 2012 “Person”.
58.7 RCNN.
RCNN-based.
AttentionNet+RCNN.
AttentionNet+RCNN.
Evaluation on PASCAL VOC Series.
PASCAL VOC 2007 “Person”.
PASCAL VOC 2012 “Person”.
Precision-recall curve on PASCAL VOC 2007 “Person”.
58.7