Trimps at ILSVRC2015image-net.org/challenges/talks/Trimps_ilsvrc2015.pdf · Trimps at ILSVRC2015...

Trimps at ILSVRC2015

Jie SHAO, Xiaoteng ZHANG, Jianying ZHOU, Zhengyan DING,

Wenfei WANG, Lin MEI, Chuanping HU

17 December 2015

The Third Research Institute of the Ministry of Public Security, P.R. China.

[email protected]

Summary of Trimps Submission

• Object localization

─ 2nd place, 12.29% error (1st place with extra data)

• Object detection from video (VID)

─ 4th place, 0.461 mAP (3rd place with extra data)

• Scene classification

─ 4th place, 17.98% error

• Object detection

─ 7th place, 0.446 mAP (4th place with extra data)

2

Object Localization

• Simple pipeline

3

Object Localization — CLS

• Training

– Multiple CNN models with large diversity

• 7 * BN-Inception (32 Layers)

• 2 * MSRA-Net (22 Layers)

– Data augmentation

• Random crops, multi-scale, contrast and color jittering

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift, Ioffe S, Szegedy C. 2015Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification, He K, Zhang X, Ren S, et al. 2015

4


• Testing for single model

– Multi-scale densely crop

– Overfeat-style augmentation

5


• Testing for multi-model

– Scores Fusion (+1.07% accuracy)

𝑆 = 𝑖=1

𝑁

𝑊𝑖 ∗ 𝑆𝑐𝑜𝑟𝑒𝑖 , 𝐿𝑎𝑏𝑒𝑙𝑠 = 𝑓𝑡𝑜𝑝5(𝑆)

– Labels Fusion (+1.17% accuracy)

• Keep M labels for single model, N models got N*M labels

• Select 5 most labels from N*M

6


• Top-5 classification error (test set)

0.153

0.112

0.067

0.036 0.046

00.020.040.060.08

0.10.120.140.160.18

Top

-5 e

rro

r

Classification (rank #3)

7

Object Localization — LOC

• Based on Fast R-CNN

– Pre-trained models: VGG16, VGG19, GoogLeNet

– Region proposals: EdgeBoxes + Filter(~500/img)

8


• Single model improvements

– Objectness loss

– Negative categories

– Bounding box voting

• Ensemble

14.25

13.58

12.29

11

11.5

12

12.5

13

13.5

14

14.5

Baseline Improved Ensemble

Top

-5 e

rro

r

Val set

9

• Fast R-CNN


10

Fast R-CNN, Girshick R. 2015

• Negative categories and objectness loss


11


• Negative categories (training)

– Positive: IOU>=0.5, Negative: 0.2<=IOU<0.5,

Background: others

12


• Bounding box voting (testing)

For each category

– Select region b with highest score

– Select regions R, s.t.

𝐼𝑂𝑈 𝑏, 𝑅𝑖 ≥ 0.5 and score𝑅𝑖≥ 𝑡ℎ

– Voting using R+b, 𝐵𝑜𝑥 = 𝑖=1𝑘 𝑠𝑐𝑜𝑟𝑒𝑖 ∗ 𝑏𝑏𝑜𝑥𝑖

𝑖=1𝑘 𝑠𝑐𝑜𝑟𝑒𝑖

13Object detection via a multi-region & semantic segmentation-aware CNN model, Gidaris S, Komodakis N. 2015


• Multi-model ensemble (testing)

– Bounding box voting (+0.3% vs best single model)

– Most crowded (not highest scored, +1.4%)

14


• Top-5 localization error (test set)

0.3350.299

0.253

0.090.123

00.05

0.10.15

0.20.25

0.30.35

0.4

Top

-5 e

rro

r

Object Localization (rank #2)

15

Scene Classification

• Dataset

– 8.1M train images, unbalanced

– Larger image size, min dimension is 512

– Both background and foreground are important

16


• Design

– Data sweeping

– Larger input size, deeper and wider network

– Multi-branch: whole image and part

17


• Data sweeping

– Random sweep training data at each epoch

– Speed up training without accuracy decline

18

𝑠(𝑛) = cos 𝜆𝑛 , 𝑛 ∈ [0, 𝑙]𝑐, 𝑛 ∈ [𝑙 + 1, 𝐾]

Stochastic Data Sweeping for Fast DNN Training, Deng W, Qian Y, et al. 2013


• Larger inception

270x270135x13567x6733x3317x178x8

19


• Two-branch inception

20


• Top-5 error (test set)

0.1690.174 0.176

0.179

0.193

0.1550.16

0.1650.17

0.1750.18

0.1850.19

0.195

Top

-5 e

rro

r

Scene Classification (rank #4)

21

Object Detection

• Pre-train model

– VGG16, VGG19, pooling replaced with conv

– COCO data used in some models

• Negative categories

– Most improved on val set: +3.2% mAP

• Objectness

– Most improved on val set: +2.2% mAP

• Bounding box voting

22

Object Detection

• Results

0.226

0.439

0.621

0.446

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

mA

P

Object Detection (rank #7)

23

* Larger test set this year

Object Detection from Video

• From 200 to 30

– Using models from object detection task

– Using video data to do fine tuning

24

Object Detection from Video

• Results

0.678

0.515 0.487 0.461 0.421

00.10.20.30.40.50.60.70.8

mA

P

Object Detection from Video (rank #4)

25

Acknowledgement

• Professor

Xiangyang Xue@Fudan

• Professor

Zheng Zhang@NYU-Shanghai

• Professor

Xiang Bai@HUST26

Date post:	28-May-2020
Category:	Documents
Upload:	others
View:	3 times
Download:	0 times

Trimps at ILSVRC2015image-net.org/challenges/talks/Trimps_ilsvrc2015.pdf · Trimps at ILSVRC2015...

Documents