Jifeng DaiWith Haozhi Qi*, Zheng Zhang, Bin Xiao, Han Hu, Bowen Cheng*, Yichen Wei
Visual Computing Group
Microsoft Research Asia
(*interns at MSRA)
Deformable Convolutional Networks-- MSRA COCO Detection & Segmentation Challenge 2017 Entry
Outline
• Deformable ConvNets idea
• Deformable ConvNets for COCO challenge
Highlights
• Enabling effective modeling of spatial transformation in ConvNets
• No additional supervision for learning spatial transformation
• Significant accuracy improvements on sophisticated vision tasks
Code is available at https://github.com/msracver/Deformable-ConvNets
Modeling Spatial Transformations
• A long standing problem in computer visionDeformation: Scale:
Viewpoint variation: Intra-class variation:
(Some examples are taken from Li Fei-fei’s course CS223B, 2009-2010)
Traditional Approaches
• 1) To build training datasets with sufficient desired variations
• 2) To use transformation-invariant features and algorithms
• Drawbacks: geometric transformations are assumed fixed and known, hand-crafted design of invariant features and algorithms
Scale Invariant Feature Transform (SIFT) Deformable Part-based Model (DPM)
Spatial Transformations in CNNs
• Regular CNNs are inherently limited to model large unknown transformations• The limitation originates from the fixed geometric structures of CNN modules
regular convolution regular RoI Pooling2 layers of regular convolution
Spatial Transformer Networks
• Learning a global, parametric transformation on feature maps• Prefixed transformation family, infeasible for complex vision tasks
Deformable Convolution
• Local, dense, non-parametric transformation• Learning to deform the sampling locations in the convolution/RoI Pooling modules
regular deformed scale & aspect ratio rotation
Deformable Convolution
Regular convolution
Deformable convolution
where is generated by a sibling branch of regular convolution
Deformable RoI Pooling
deformable RoI Pooling
Regular RoI pooling
Deformable RoI pooling
where is generated by a sibling fc branch
Deformable ConvNets
• Same input & output as the plain versions• Regular convolution -> deformable convolution
• Regular RoI pooling -> deformable RoI pooling
• End-to-end trainable without additional supervision
Sampling Locations of Deformable Convolution
(a) standard convolution (b) deformable convolution
Part Offsets in Deformable RoI Pooling
Deformable ConvNets for Object Detection
Fast(er) RCNN R-FCN FPN
RoI Pooling
Car Person
Car
Person
Position SensitiveRoI Pooling
Car Person
RoI Pooling
Conv5 Position SensitiveScore Map
Conv3
Conv4
Conv5 P5
P4
P3
• Regular object detectors
Deformable ConvNets for Object Detection
Fast(er) RCNN R-FCN FPN
DeformableRoI Pooling
Car Person
Car
Person
DeformablePS RoI Pooling
Car Person
DeformableRoI Pooling
: Deformable Convolution / RoI Pooling
Conv5 Position SensitiveScore Map
Conv3
Conv4
Conv5 P5
P4
P3
• Deformable object detectors
XCeption -> Aligned XCeption
14x14x728 feature maps
separable conv 728, 3x3, pad 1
separable conv 1024, 3x3, pad 1
max pool 3x3, stride2, pad 1
conv 1024 1x1 stride2, pad 0
separable conv 1536, 3x3, pad 1
separable conv 1536, 3x3, pad 1
7x7x2048 feature maps
separable conv 2048, 3x3, pad 1
entry flow
conv 32, 3x3, stride2, pad 1
conv 64, 3x3, pad 1
separable conv 128, 3x3, pad 1
separable conv 128, 3x3, pad 1
max pool 3x3, stride 2, pad 1
conv 128 1x1 stride2, pad 0
separable conv 728, 3x3, pad 1
separable conv 728, 3x3, pad 1
max pool 3x3, stride2, pad 1
14x14x728 feature maps
224x224x3 images
separable conv 256, 3x3, pad 1
separable conv 256, 3x3, pad 1conv 256 1x1
separable conv 256, 3x3, pad 1
separable conv 256, 3x3, pad 1
separable conv 256, 3x3, pad 1
max pool 3x3, stride2, pad 1
separable conv 728, 3x3, pad 1
separable conv 728, 3x3, pad 1
separable conv 728, 3x3, pad 1
conv 256 1x1, stride2, pad 0
conv 728 1x1
conv 728 1x1, stride2, pad 0
separable conv 728, 3x3, pad 1
separable conv 728, 3x3, pad 1
separable conv 728, 3x3, pad 1
14x14x728 feature maps
Repeat 16 times
14x14x728 feature maps
middle flow
exit flow
• Proper feature alignment in XCeption• Efficient: 9.5 GFLOPS on 224*224 img (ResNet-101, 7.6 GFLOPS)
• Accurate: mAP 2.8% better than ResNet-101 using FPN on COCO (det, test-dev)
Object Detection on COCO (Test-dev)
• MSRA 2017 Entry• ~3% mAP improvements by Deformable ConvNets
• Best single model performance: 48.5%
37.4
40.7
41.6
45.2
40.5
43.3
44.2
45.3
46.9
47.9
48.5
50.7
30 35 40 45 50
FPN+OHEM (RESNET-101)
FPN+OHEM (ALIGNED XCEPTION)
+ MASK
+ SOFT NMS
+ MULTI-SCALE TESTING
+ ITERATIVE TESTING
+ HORIZONTAL FLIP
+ ENSEMBLE (6 MODELS)
mAP (%)
Deformable Regular
Object Detection on COCO (Test-dev)
• Deformable ConvNets v.s. regular ConvNets• Noticeable improvements for varies baselines
• Marginal parameter & computation overhead
23.2
30.3
32.1
34.5
37.4
40.2
45.2
25.8
35
35.7
37.5
40.5
43.3
48.5
20 25 30 35 40 45 50
CLASS-AWARE RPN (RESNET-101)
FASTER R-CNN, 2FC (RESNET-101)
R-FCN (RESNET-101)
R-FCN (ALIGNED-INCEPTION-RESNET)
FPN+OHEM (RESNET-101)
FPN+OHEM (ALIGNED-XCEPTION)
FPN++ (ALIGNED-XCEPTION)
mAP (%)
Deformable Regular
Conclusion
• Deformable ConvNets for dense spatial modeling• Simple, efficient, deep, and end-to-end
• No additional supervision
• Feasible and effective on sophisticated vision tasks for the first time
• Our team
Jifeng DaiHaozhi Qi* Yichen Wei
* interns at MSRA
Zheng Zhang Bowen Cheng*Bin Xiao Han Hu