+ All Categories
Home > Documents > for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update....

for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update....

Date post: 03-Jul-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
16
Stitcher: Feedback-driven Data Provider for Object Detection Yukang Chen 1* , Peizhen Zhang 2* , Zeming Li 2 , Yanwei Li 1 , Xiangyu Zhang 2 , Gaofeng Meng 1 , Shiming Xiang 1 , Jian Sun 2 , Jiaya Jia 3 1 NLPR, Institute of Automation, Chinese Academy of Sciences 2 Megvii Technology 3 The Chinese University of Hong Kong Abstract. Object detectors commonly vary quality according to scales, where the performance on small objects is the least satisfying. In this paper, we investigate this phenomenon and discover that: in the ma- jority of training iterations, small objects contribute barely to the total loss, causing poor performance with imbalanced optimization. Inspired by this finding, we present Stitcher, a feedback-driven data provider, which aims to train object detectors in a balanced way. In Stitcher, im- ages are resized into smaller components and then stitched into the same size to regular images. Stitched images contain inevitable smaller objects, which would be beneficial with our core idea, to exploit the loss statistics as feedback to guide next-iteration update. Experiments have been con- ducted on various detectors, backbones, training periods, datasets, and even on instance segmentation. Stitcher steadily improves performance by a large margin in all settings, especially for small objects, with nearly no additional computation in both training and testing stages. Code and models will be publicly available. Keywords: Loss Feedback, Scale Balance, Small Objects 1 Introduction Deep object detector performance varies in a complicated way. A key challenge in deep networks for object detection is the large scale variation, which often emerges to be the difficulty to detect small objects. For instance, in the result of FPN with ResNet-50 [11,13] released in [1] (AP: 36.7 %, AP s : 21.1 %, AP m : 39.9 %, AP l : 48.1 %), accuracy on small objects is nearly half of that on middle- size or large objects, due to imbalanced optimization on various scales. This problem depresses the overall performance and hinders the generality of object detectors in diverse scenes. A reasonable explanation is that supervisory signals on small objects are insufficient. Supervisory signals can be naturally reflected by training loss. We study the distributions of loss for different scales over iterations and show them in Fig. 1. The statistics are computed on a common baseline, i.e., Faster R-CNN with * Equal contribution. Email: [email protected] arXiv:2004.12432v1 [cs.CV] 26 Apr 2020
Transcript
Page 1: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

Stitcher: Feedback-driven Data Providerfor Object Detection

Yukang Chen1∗, Peizhen Zhang2∗, Zeming Li2, Yanwei Li1,Xiangyu Zhang2, Gaofeng Meng1, Shiming Xiang1, Jian Sun2, Jiaya Jia3

1 NLPR, Institute of Automation, Chinese Academy of Sciences2 Megvii Technology 3 The Chinese University of Hong Kong

Abstract. Object detectors commonly vary quality according to scales,where the performance on small objects is the least satisfying. In thispaper, we investigate this phenomenon and discover that: in the ma-jority of training iterations, small objects contribute barely to the totalloss, causing poor performance with imbalanced optimization. Inspiredby this finding, we present Stitcher, a feedback-driven data provider,which aims to train object detectors in a balanced way. In Stitcher, im-ages are resized into smaller components and then stitched into the samesize to regular images. Stitched images contain inevitable smaller objects,which would be beneficial with our core idea, to exploit the loss statisticsas feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods, datasets, andeven on instance segmentation. Stitcher steadily improves performanceby a large margin in all settings, especially for small objects, with nearlyno additional computation in both training and testing stages. Code andmodels will be publicly available.

Keywords: Loss Feedback, Scale Balance, Small Objects

1 Introduction

Deep object detector performance varies in a complicated way. A key challengein deep networks for object detection is the large scale variation, which oftenemerges to be the difficulty to detect small objects. For instance, in the resultof FPN with ResNet-50 [11,13] released in [1] (AP: 36.7 %, APs: 21.1 %, APm:39.9 %, APl: 48.1 %), accuracy on small objects is nearly half of that on middle-size or large objects, due to imbalanced optimization on various scales. Thisproblem depresses the overall performance and hinders the generality of objectdetectors in diverse scenes. A reasonable explanation is that supervisory signalson small objects are insufficient.

Supervisory signals can be naturally reflected by training loss. We study thedistributions of loss for different scales over iterations and show them in Fig. 1.The statistics are computed on a common baseline, i.e., Faster R-CNN with

∗ Equal contribution. Email: [email protected]

arX

iv:2

004.

1243

2v1

[cs

.CV

] 2

6 A

pr 2

020

Page 2: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

2 Stitcher

ᤒ໒ 1

Baseline Stitcher

0.55948717 0.233521980.13263934 0.175155560.09925454 0.17647887

0.0726689 0.153841820.05237138 0.117112930.03615114 0.0771180.02355268 0.040881830.01393163 0.01616127

0.0071769 0.007123350.00276633 0.00260439

BaselinePr

opor

tion

of it

erat

ions

0

0.1

0.2

0.3

0.4

0.5

0.6

Ratio of small object loss

Stitcher

0.0 0.1 0.2 0.3 0.50.4 0.6 0.7 0.8 0.9 1.0

ᤒ໒ 2

Baseline Accuracy Multi-scale Multi Scale Stitcher Accuracy8.7 36.71 10.5 37.25 9 38.64

11.5 39.1 14.2 39.7 11.7 40.78

17.2 37.7 20.5 39.12 17.5 39.9

23.4 39.77 28.5 41.65 23.5 42.11

36.00

37.00

38.00

39.00

40.00

41.00

42.00

43.00

Training hours6.0 8.0 10.0 12.0 14.0 16.0 18.0 20.0 22.0 24.0 26.0 28.0 30.0

ResNet 50 (1x)

ResNet 101 (1x)

ResNet 50 (2x)

ResNet 101 (2x)

ᤒ໒ 2-1

Baseline Accuracy Multi-scale Multi Scale Stitcher Accuracy

8.7 36.71 10.5 37.25 9 38.64

11.5 39.1 14.2 39.7 11.7 40.78

17.2 37.7 20.5 39.12 17.5 39.9

23.4 39.77 28.5 41.65 23.5 42.11

Accu

racy

36.00

37.00

38.00

39.00

40.00

41.00

42.00

43.00

Training hours6.0 8.0 10.0 12.0 14.0 16.0 18.0 20.0 22.0 24.0 26.0 28.0 30.0

Baseline Multi-scale Stitcher

�1

Fig. 1: Ratio of loss from small ob-jects across training iterations. For theFaster R-CNN baseline, in more than50% of iterations, small objects con-tribute less than 10% to the overall loss.Loss distribution gets balanced whenStitcher is adopted

ᤒ໒ 1

Baseline Accuracy

8.7 36.71

11.5 39.1

17.2 37.7

23.4 39.77

10.5 37.25

14.2 39.7

20.5 39.12

28.5 41.65

9.0 38.64

11.7 40.78

17.5 39.9

23.5 42.11

ᤒ໒ 1-1

Multi-scale Multi Scale10.5 37.25

14.2 39.7

20.5 39.12

28.5 41.65

ᤒ໒ 1-2

Stitcher Accuracy9.0 38.64

11.7 40.78

17.5 39.9

23.5 42.11

36

37

38

39

40

41

42

43

Training hours6 8 10 12 14 16 18 20 22 24 26 28 30

ResNet 50 (1x)

ResNet 101 (1x)

ResNet 50 (2x)

ResNet 101 (2x)

Accu

racy

36

37

38

39

40

41

42

43

Training hours6 8 10 12 14 16 18 20 22 24 26 28 30

Baseline Multi-scale Stitcher

�1

Fig. 2: Accuracy versus Training hourson COCO with Faster R-CNN. In var-ious settings, Stitcher improves per-formance by about 2% AP with nearlyno extra training time, while multi-scale training is inferior and requiresmore cost

ResNet-50 [11] and FPN [13] as backbones on the Microsoft COCO dataset [14].Objects in small, middle and large scales are defined according to their sizes.Specifically, in iteration t, the loss for small objects, Lt

s, accounts for ground-truth boxes whose sizes are smaller than 1024. rts denotes the ratio of Lt

s againstthe total loss Lt in current iteration. It is noticeable that in more than 50% ofiterations, rts are negligible (less than 0.1), as shown in Fig. 1. Lack of knowledgeon small objects leads to the unbalance and the corresponding poor performance.

In this paper, we propose Stitcher, a feedback-driven data provider, thatenhances the performance of object detection, by utilizing training loss in afeedback manner. In Stitcher, we introduce stitched images that have the samesize as the regular ones and consist of resized smaller components. The core ideais to leverage loss statistics in the current iteration as the feedback to adaptivelydetermine the input choice for the next. As illustrated in Fig. 3, if the ratio of lossfor small objects rts is negligible in current iteration t, the input to iteration t+1is the stitched images, where smaller objects are inevitably more abundant. Oth-erwise, input remains regular images under the default setting. Image stitchingmitigates the image level imbalance in the input feature space over the primitivedata distribution. Simultaneously, the feedback paradigm alleviates the unfairoptimization. They both dedicate to more balanced object detection.

In experiments, we verify the effectiveness of Stitcher on various detectionframeworks (Faster R-CNN, RetinaNet), backbones (ResNets, ResNexts), train-ing schedules (1×, 2×), datasets (COCO, VOC) and even on instance segmen-tation. In all these settings, our method improves accuracy by a large margin,as shown in Fig. 2, especially for small objects. As Stitcher also involves images

Page 3: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

Stitcher 3

Feedback-driven Data Provider

Regular imagesStitched images

Feedback!"# < % ?

Y

N

Loss

Fig. 3: The pipeline illustration. Whether to use stitched images in the nextiteration is adaptively determined by the current feedback

in multiple scales, we also compare Stitcher with multi-scale training. The latterrequires longer training time but its performance is inferior.

Stitcher can be easily incorporated into any detector. It imposes nearly noextra burden during both training and inference. Additional costs are only dueto loss statistics computation and image stitching, which are almost negligiblecompared to much heavier forward and backward propagation.

In the following, we first analyze existing problems in Section 2 and thenintroduce our method in Section 3. In Section 4, the relation of Stitcher toprevious literature are discussed. Experimental results are presented in Section 5.

2 Problem Analysis

Object detectors vary performance dramatically with different scales. In thissection, we provide explanations to this phenomenon with experimental analysis.

2.1 Image level Analysis

Quantitative Analysis. Small objects are very common in natural images,while their distributions are not predictable across different images. As illus-trated in Table 1, 41.4% of objects in the COCO training set are small objects,much more than those in the other two scales. However, only 52.3% of imagescontain small objects. In contrast, the proportions for medium and large ob-jects are 70.7% and 83.0% respectively. Put differently, in some images, mostobjects are small, while nearly half of the images contain no small objects onthe contrary. Such severe imbalance hampers the training process.

Qualitative Analysis. In regular images, objects would probably be blurredfor photographic issues, e.g., out of focus or motion blur. If regular images areresized to be smaller, medium-sized or large objects inside would also becomesmaller ones, whose contour or details, however, remain clearer than the originalsmall ones. In Fig. 4, the ball, which is resized from a larger scale, is clearer thanthe kite, although they have similar sizes of 29× 31 and 30× 30 respectively.

Page 4: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

4 Stitcher

29

31

30

30

29

31

30

30

29 × 31 &'( 30 × 30 &'(

Fig. 4: Qualitative comparison between natural small objects (kites) and those(ball) within smaller images obtained by re-scaling. The resized ball is visuallyclearer than the kite in texture – they share similar sizes

Table 1: Distribution of objects in different scales on the MS-COCO training set

Small Mid Large

Ratio of total boxes (%) 41.4 34.3 24.3Ratio of images included (%) 52.3 70.7 83.0

The above analysis inspires the component stitching in Section 3.1.

2.2 Training Level Analysis

In this section, the scale issue in training level is analyzed through loss statistics.We use the trainval35k split in COCO dataset for training and the minivalfor evaluation. The ImageNet pre-trained ResNet-50 with FPN is served as thebackbone. We train on Faster R-CNN with 1× training period (90k). All trainingsetting, including learning rate, momentum, weight decay and batch size, directlyfollows the default values. During training, we record loss over three scales ineach training iteration. According to these statistics, Fig. 1 illustrates the lossdistributions over various scales.

Small objects has uneven distributions over images, which consequently makesthe training suffer from further imbalance problem. Even if small objects are in-cluded in some images, they still have a chance to be ignored during training.Fig. 1 illustrates that, in more than 50% iterations, small objects account for lessthan 10% among the total. Training losses are dominated by large and medium-sized objects. Thus, the supervisory signals for small objects are insufficient,which severely harms the small object accuracy and even the overall perfor-mance. This phenomenon motivates the selection paradigm in Stitcher, which isintroduced in Section 3.2.

Page 5: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

Stitcher 5

O, S, ℎ, C NTO, S, ℎN ,CN

C/2C C/2

ℎ2

ℎ2

O, S, ℎ, C

ℎN

C/N

Common Images Stitch in spatial dimension (ℎ, C). Stitch in batch dimension O.(a) Regular Images. (b) Stitch in spatial dimension. (c) Stitch in batch dimension.

O, S, ℎ, C NTO, S, ℎN ,CN

C/2C C/2

ℎ2

ℎ2

O, S, ℎ, C

ℎN

C/N

Common Images Stitch in spatial dimension (ℎ, C). Stitch in batch dimension O.

O, S, ℎ, C NTO, S, ℎN ,CN

C/2C C/2

ℎ2

ℎ2

O, S, ℎ, C

ℎN

C/N

Common Images Stitch in spatial dimension (ℎ, C). Stitch in batch dimension O.

"

($, &, ℎ, ") ($, &, ℎ, ")

ℎ2

ℎ2

ℎ2

(4$, &, ℎ2 ,"2)

"/2 "/2"/2

Fig. 5: Regular images and stitched images. (a) A batch of regular images astraining inputs, with shape (n, c, h, w); (b) A batch of stitched images, withshape (n, c, h, w), in one of which quadruple small images are stitched along spa-tial dimension; (c) A batch of stitched images, with shape (kn, c, h/

√k,w/

√k),

where images are concatenated along batch dimension n. We set k = 4 for visu-alization

3 Stitcher

According to the previous analysis, imbalance on different scales stems from thedistributions in image level and gets worse in the training stage. Inspired by thisfinding, we delve into this ubiquitous issue in object detection – how to relievethe scale imbalance, especially for small objects. To this end, a novel trainingstrategy, called Stitcher, is presented in this section. It is consists of two differentstages in image and training levels, corresponding to the analysis in Section 2.

3.1 Image Level Operations - Component Stitching

Referring to the statistics exhibited in Table 1, nearly half of the images inthe training set contain no small object. Such an image-granularity imbalanceseverely disturbs the mini-batch level optimization process. To resolve this, wepropose Stitcher, a self-adaptive data generator which produces either stitchedimages or regular images dynamically, guided by the penalization signals.

Given a handful of images with resolution resized to be unified, a stitchedimage is constructed by scaling and collaging k (k = 1, 22, 32, ..) natural im-ages together such that the aspect ratio (of each component) is preserved, i.e.,(h/√k,w/

√k). It is acknowledged to keep the aspect ratio for retaining original

object properties. Trivially, a natural image is inducted into the stitched imagewhen taking k as 1. Specifying a stitching order of k as 4, we visualize an exam-ple in Fig. 5 (b). The scale-imbalance of an image batch (acting as a minimaltraining entity) gets alleviated with the assistance of image stitching by manu-facturing more small objects. Since stitched images have identical size to regularones, no additional computation is introduced in network propagation. Unlessspecified, experiments of Stitcher are conducted with images in Fig. 5 (b).

The implicit square constrain of the stitching number (k = 12, 22, 32, ...) isleft as a major concern. To make it more flexible, we provide another version

Page 6: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

6 Stitcher

of implementation where images are stitched along the batch dimension andthe batch tensor shape becomes (kn, c, h/

√k,w/

√k). Overall, the tensor pixel

amount remains unchanged whereas the stitching number is without being im-posed on squares of some number. A corresponding example is illustrated inFig. 5 (c). For detailed choice of stitching order k, please refer to Section 5.4.Stitcher provides data with consistent tensor volume but dynamic batch size,which generalizes conventional multi-scale training (fixed batch size).

ᤒ໒ 1

Small Mid Large Small Mid Large Small Mid Large

0.16306124 0.41870124 1 0.16306124 0.25564 0.58129876 0.14658734 0.257884 0.59552866

0.13347112 0.38769812 1 0.13347112 0.254227 0.61230188 0.14767805 0.2672804 0.58504155

0.14004932 0.40526032 1 0.14004932 0.265211 0.59473968 0.15107069 0.2783092 0.57062011

0.14741864 0.40884764 1 0.14741864 0.261429 0.59115236 0.15201854 0.2869952 0.56098626

0.15166316 0.42806716 1 0.15166316 0.276404 0.57193284 0.15428278 0.2790532 0.56666402

0.14853464 0.42299864 1 0.14853464 0.274464 0.57700136 0.15597365 0.2857572 0.55826915

0.14976472 0.42388572 1 0.14976472 0.274121 0.57611428 0.15606293 0.2842888 0.55964827

0.15246668 0.43777468 1 0.15246668 0.285308 0.56222532 0.16272966 0.2871844 0.55008594

0.15368436 0.44366436 1 0.15368436 0.28998 0.55633564 0.16072632 0.290608 0.54866568

0.15327268 0.44266068 1 0.15327268 0.289388 0.55733932 0.16819955 0.2896156 0.54218485

0.14593188 0.42751688 1 0.14593188 0.281585 0.57248312 0.16671949 0.2881572 0.54512331

0.16306372 0.44024472 1 0.16306372 0.277181 0.55975528 0.16527166 0.2894968 0.54523154

0.15349588 0.42456388 1 0.15349588 0.271068 0.57543612 0.16606774 0.281994 0.55193826

0.1592718 0.4454508 1 0.1592718 0.286179 0.5545492 0.1659497 0.2905996 0.5434507

0.1538778 0.4498908 1 0.1538778 0.296013 0.5501092 0.16345829 0.2994648 0.53707691

0.15636524 0.44615024 1 0.15636524 0.289785 0.55384976 0.1597239 0.2932664 0.5470097

0.15420268 0.43722568 1 0.15420268 0.283023 0.56277432 0.16418344 0.29499 0.54082656

0.15997612 0.43419012 1 0.15997612 0.274214 0.56580988 0.16666592 0.2874244 0.54590968

0.16350392 0.45465392 1 0.16350392 0.29115 0.54534608 0.16969202 0.3016028 0.52870518

0.16293352 0.45344452 1 0.16293352 0.290511 0.54655548 0.16847979 0.3043932 0.52712701

0.15676948 0.44567848 1 0.15676948 0.288909 0.55432152 0.17193989 0.2994464 0.52861371

0.16246356 0.45149456 1 0.16246356 0.289031 0.54850544 0.16601021 0.2930168 0.54097299

0.1677472 0.4587062 1 0.1677472 0.290959 0.5412938 0.17018256 0.2836548 0.54616264

0.17396704 0.46350404 1 0.17396704 0.289537 0.53649596 0.16958637 0.2960768 0.53433683

0.1613674 0.4534904 1 0.1613674 0.292123 0.5465096 0.18099834 0.3030324 0.51596926

0.16722516 0.45209016 1 0.16722516 0.284865 0.54790984 0.18189213 0.3089016 0.50920627

0.16566648 0.45772648 1 0.16566648 0.29206 0.54227352 0.18421589 0.3077436 0.50804051

0.16314556 0.45025656 1 0.16314556 0.287111 0.54974344 0.17797472 0.312902 0.50912328

0.16925256 0.45586856 1 0.16925256 0.286616 0.54413144 0.18413306 0.3188592 0.49700774

0.16468812 0.45817112 1 0.16468812 0.293483 0.54182888 0.19087915 0.3078072 0.50131365

0.1690554 0.4568644 1 0.1690554 0.287809 0.5431356 0.17849304 0.3196372 0.50186976

0.16235444 0.44388444 1 0.16235444 0.28153 0.55611556 0.18732184 0.3138228 0.49885536

0.16150008 0.43664808 1 0.16150008 0.275148 0.56335192 0.18927062 0.307686 0.50304338

0.16650472 0.45739772 1 0.16650472 0.290893 0.54260228 0.18606597 0.3125276 0.50140643

0.17062896 0.46673296 1 0.17062896 0.296104 0.53326704 0.18083962 0.3174396 0.50172078

0.16197004 0.46035604 1 0.16197004 0.298386 0.53964396 0.18666811 0.3172364 0.49609549

0.16520148 0.46141848 1 0.16520148 0.296217 0.53858152

0.16454552 0.46104252 1 0.16454552 0.296497 0.53895748 0.064206 0.214348 0.721446

0.15752588 0.45250888 1 0.15752588 0.294983 0.54749112 0.0703168 0.2398332 0.68985

0.15871256 0.45445756 1 0.15871256 0.295745 0.54554244 0.073988 0.2495656 0.6764464

0.15910936 0.44796636 1 0.15910936 0.288857 0.55203364 0.0780552 0.2456228 0.676322

0.17045536 0.46483036 1 0.17045536 0.294375 0.53516964 0.0817424 0.2473348 0.6709228

0.16850856 0.46828356 1 0.16850856 0.299775 0.53171644 0.080076 0.2533312 0.6665928

0.1672202 0.4584022 1 0.1672202 0.291182 0.5415978 0.0836044 0.2507288 0.6656668

0.16182992 0.44367692 1 0.16182992 0.281847 0.55632308 0.084388 0.256872 0.65874

0.16406688 0.46641688 1 0.16406688 0.30235 0.53358312 0.0817476 0.2422356 0.6760168

0.17726296 0.47671696 1 0.17726296 0.299454 0.52328304 0.086314 0.2609172 0.6527688

0.16833248 0.46667348 1 0.16833248 0.298341 0.53332652 0.0816176 0.2534448 0.6649376

0.16972748 0.47487548 1 0.16972748 0.305148 0.52512452 0.0854132 0.255068 0.6595188

0.16603972 0.47573672 1 0.16603972 0.309697 0.52426328 0.0910348 0.2653492 0.643616

0.1702768 0.4784198 1 0.1702768 0.308143 0.5215802 0.086434 0.2701032 0.6434628

0.17138784 0.45656084 1 0.17138784 0.285173 0.54343916 0.0849312 0.253132 0.6619368

0.17354544 0.47796044 1 0.17354544 0.304415 0.52203956 0.0901176 0.255246 0.6546364

0.16848624 0.46396924 1 0.16848624 0.295483 0.53603076 0.08893 0.2592176 0.6518524

0.16117892 0.44912292 1 0.16117892 0.287944 0.55087708 0.0866444 0.2578756 0.65548

0.16519528 0.44557128 1 0.16519528 0.280376 0.55442872 0.0895092 0.2594948 0.650996

0.17053472 0.45940372 1 0.17053472 0.288869 0.54059628 0.0866284 0.2685924 0.6447792

0.17252244 0.46021344 1 0.17252244 0.287691 0.53978656 0.0898104 0.2675768 0.6426128

0.17096748 0.46514348 1 0.17096748 0.294176 0.53485652 0.0948336 0.2551392 0.6500272

0.1702024 0.4684194 1 0.1702024 0.298217 0.5315806 0.0864132 0.2647452 0.6488416

0.17748492 0.48022092 1 0.17748492 0.302736 0.51977908 0.089064 0.2624152 0.6485208

0.18088376 0.48489376 1 0.18088376 0.30401 0.51510624 0.1013088 0.2860584 0.6126328

0.18600496 0.49041196 1 0.18600496 0.304407 0.50958804 0.0955468 0.2788564 0.6255968

0.18594916 0.49890316 1 0.18594916 0.312954 0.50109684 0.0946416 0.2840276 0.6213308

0.17690336 0.48263136 1 0.17690336 0.305728 0.51736864 0.0980136 0.2811532 0.6208332

0.17928168 0.48053768 1 0.17928168 0.301256 0.51946232 0.0987416 0.2868244 0.614434

0.18824192 0.48854892 1 0.18824192 0.300307 0.51145108 0.097834 0.2691212 0.6330448

0.17746756 0.50548956 1 0.17746756 0.328022 0.49451044 0.103086 0.2819144 0.6149996

0.17969088 0.49669188 1 0.17969088 0.317001 0.50330812 0.1002228 0.279206 0.6205712

0.18079448 0.48582248 1 0.18079448 0.305028 0.51417752 0.105128 0.2782204 0.6166516

0.18595288 0.49637988 1 0.18595288 0.310427 0.50362012 0.101212 0.2864548 0.6123332

0.19168292 0.51423592 1 0.19168292 0.322553 0.48576408 0.1080364 0.2809492 0.6110144

0.17054216 0.49022416 1 0.17054216 0.319682 0.50977584 0.104404 0.2757896 0.6198064

0.19802056 0.50357656 1 0.19802056 0.305556 0.49642344

0.191332 0.49978 1 0.191332 0.308448 0.50022

0.17102576 0.49566776 1 0.17102576 0.324642 0.50433224

0.18245608 0.49850008 1 0.18245608 0.316044 0.50149992

0.1852002 0.5047442 1 0.1852002 0.319544 0.4952558

0.17991532 0.49906732 1 0.17991532 0.319152 0.50093268

0.19593984 0.50020784 1 0.19593984 0.304268 0.49979216

0.18991592 0.49895692 1 0.18991592 0.309041 0.50104308

0.1874012 0.4932802 1 0.1874012 0.305879 0.5067198

0.1911398 0.5036268 1 0.1911398 0.312487 0.4963732

0.18250444 0.49893144 1 0.18250444 0.316427 0.50106856

0.18738012 0.49408012 1 0.18738012 0.3067 0.50591988

0.17357768 0.49990368 1 0.17357768 0.326326 0.50009632

0.18578176 0.49828176 1 0.18578176 0.3125 0.50171824

0.18313436 0.49388036 1 0.18313436 0.310746 0.50611964

0.18826176 0.50907776 1 0.18826176 0.320816 0.49092224

0.18801376 0.50431576 1 0.18801376 0.316302 0.49568424

0

0.25

0.5

0.75

1

Small Mid Large

0

0.25

0.5

0.75

1

Small Mid Large

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Small Mid Large

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

ᤒ໒ 2

Small Mid Large Small Mid Large

0k 0.16306124 0.25564 0.58129876 0.062665 0.215468 0.721867

10k 0.14933866 0.2726172 0.57804414 0.0716415 0.2373424 0.691016120k 0.15726226 0.2840709 0.55866684 0.0824527 0.2520667 0.665480630k 0.16522926 0.2894694 0.54530134 0.0837731 0.2529164 0.663310540k 0.16379991 0.2913312 0.54486889 0.0881294 0.2609576 0.65091350k 0.16725529 0.2971026 0.53564211 0.087928 0.2612951 0.650776960k 0.16942976 0.2930487 0.53752154 0.0900303 0.2624691 0.647500670k 0.18127027 0.3081449 0.51058483 0.0973777 0.2825239 0.620098480k 0.18520677 0.3150316 0.49976163 0.0999711 0.2792665 0.620762490k 0.18571108 0.3137224 0.50056652 0.1046951 0.2803535 0.6149514

Prop

ortio

n

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Baseline0k 10k 20k 30k 40k 50k 60k 70k 80k 90k

Small Mid Large

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

Stitcher10k 20k 30k 40k 50k 60k 70k 80k 90k

ᤒ໒ 2-1

Small Mid Large Small Mid Large

0.16306124 0.25564 0.58129876 0.062665 0.215468 0.721867

10k 0.14933866 0.2726172 0.57804414 0.0716415 0.2373424 0.691016120k 0.15726226 0.2840709 0.55866684 0.0824527 0.2520667 0.665480630k 0.16522926 0.2894694 0.54530134 0.0837731 0.2529164 0.663310540k 0.16379991 0.2913312 0.54486889 0.0881294 0.2609576 0.65091350k 0.16725529 0.2971026 0.53564211 0.087928 0.2612951 0.650776960k 0.16942976 0.2930487 0.53752154 0.0900303 0.2624691 0.647500670k 0.18127027 0.3081449 0.51058483 0.0973777 0.2825239 0.620098480k 0.18520677 0.3150316 0.49976163 0.0999711 0.2792665 0.620762490k 0.18571108 0.3137224 0.50056652 0.1046951 0.2803535 0.6149514

�1

Fig. 6: Loss distribution over different scalesfor Stitcher and Faster R-CNN Res-50-FPN

ᤒ໒ 1

AP ours AP small ours AP baseline AP small baseline

5k 0.1885 0.1102 0.1660 0.0917

10k 0.2402 0.1439 0.2212 0.1186

15k 0.2588 0.1607 0.2469 0.1267

20k 0.2727 0.1630 0.2611 0.1379

25k 0.2916 0.1755 0.2682 0.1494

30k 0.2944 0.1695 0.2762 0.1513

35k 0.2930 0.1764 0.2870 0.1593

40k 0.2920 0.1748 0.2861 0.1559

45k 0.3018 0.1909 0.2925 0.1656

50k 0.3069 0.1974 0.2984 0.1769

55k 0.3150 0.1887 0.2857 0.1635

60k 0.3141 0.1906 0.3026 0.1792

65k 0.3691 0.2228 0.3570 0.2083

70k 0.3753 0.2286 0.3605 0.2100

75k 0.3763 0.2303 0.3629 0.2074

80k 0.3781 0.2322 0.3657 0.2137

85k 0.3841 0.2365 0.3689 0.2161

90k 0.3843 0.2364 0.3697 0.2166

0

0.1

0.2

0.3

0.4

5k 10k 15k 20k 25k 30k 35k 40k 45k 50k 55k 60k 65k 70k 75k 80k 85k 90k

AP ours AP baseline AP small ours AP small baseline

ᤒ໒ 2

AP (Stitcher) APs (Stitcher) AP (Baseline) APs (Baseline)

10k 24.02 14.39 22.12 11.86

20k 27.27 16.3000 26.11 13.79

30k 29.44 16.95 27.62 15.13

40k 29.2000 17.48 28.61 15.59

50k 30.69 19.74 29.84 17.69

60k 31.41 19.06 30.26 17.92

70k 37.53 22.86 36.05 20.37

80k 38.01 23.64 36.57 20.74

90k 38.64 24.4000 36.71 21.11

Accu

racy

10

15

20

25

30

35

40

Iterations10k 20k 30k 40k 50k 60k 70k 80k 90k

AP (Stitcher) AP (Baseline)APs (Stitcher) APs (Baseline)

�1

Fig. 7: Performance curve overtraining iterations

3.2 Training Level Module - Selection Paradigm

Stitched images potentially contain more small objects while the timing to ex-ploit them depends. Dated back to Fig. 1, in more than 50% of iterations, theloss from small objects accounts for less than 10%. To avoid such undesirabletrend, we propose a rectified paradigm, determining the input of next iterationbased upon feedback from current pass. If the loss from small objects is negligible(below a threshold τ) in iteration t, we assume knowledge on small objects is farfrom enough. To compensate for the lack of information, we adopt the stitchedimages as input to iteration t+ 1. Otherwise, regular images are picked.

To calculate the ratio of small objects loss among all scales, we have thefollowing procedure. Strictly speaking, the scale of an object is determined byits mask area, which is only available in segmentation tasks. However, towardsgeneric object detection, ground-truth masks are unavailable. Thus, we use thebox area instead. As in Eq. (1), for object o, its area ao is approximately rep-resented as the box area, denoted as ho × wo. Lt

s denotes the loss from smallobjects whose area ao is no higher than As (1,024 in the protocol of COCO).The proportion of small objects is obtained with Eq. (3) as

ao ≈ ho × wo (1)

Lregs = Lt

ao<As(2)

Page 7: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

Stitcher 7

rts =Lts

Lt(3)

The ratio rts serves as a latent feedback to guide the learning of next iteration.Such strategy balances the loss distribution for better optimization. We visualizethe loss distributions comparison in Fig. 6 and the performance difference inFig. 7. We measure the statistics every 10k iterations and illustrate them withsmooth. It reveals that loss distributions over various scales get more balancedwith our Stitcher, which leads to better accuracy.

3.3 Time Complexity

Only the training process is exploited by Stitcher which thus makes no burdenon inference time. Here we elaborate on its time complexity during training.

Stitcher is composed of component stitching and selection paradigm. In thestitching part, nearest neighbor interpolation is utilized to down-scale images.As the maximum side of original images in COCO are 640 pixels, the inter-polation operation requires no more than 2 × 6402 ≈ 0.8M of multiplication.It is neglectable compared to the forward/backward propagation of detectionnetworks. For example, it costs ResNet-50 3.8G FLOPs (multiplication and ad-dition) to process a 224 × 224 image. In the selection paradigm, we need tocalculate the area of each selected ground-truth boxes. However, there are onlya few remained boxed each time. This step costs negligible computation.

In addition to the theoretical analysis, we also measure the running timeof Stitcher in practice. If stitched images are demanded, it costs approximately0.02 seconds extra in this iteration, beyond the regular training. In terms oftotal training time, it takes about 8.7 hours to train the baseline, Faster R-CNN with ResNet-50-FPN backbone. When being timed on the same GPUs,Stitcher spends about a quarter of an hour longer. This gap shrinks when largerbackbones are applied.

4 In Context of Literature

Our work is related to previous ones in several aspects. We discuss the relationand also mainly the differences in the following.

Multi-scale Image Pyramid[2] This is a traditional and intuitive way toremedy the large scale variation. It has been popular since the era of hand-crafted features, e.g., SIFT [16] and HOG [6]. Nowadays, deep CNN-based objectdetectors can also benefit from multi-scale training and testing, where imagesare randomly resized into different resolutions. Thus, features learned in thisway are more robust to scale variation.

Similar to multi-scale training, Stitcher is also devised to render features morerobustly towards scale variation. However, there are two essential differences.(1) Stitcher demands neither image pyramid construction nor input size adjust-ment. Stitched images still have the same size as regular ones, which hugely

Page 8: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

8 Stitcher

relieves the computation burden that is otherwise inevitable in image pyramids.(2) Objects scales in Stitcher are adaptively determined by the loss distributionover training iterations. On the contrary in image pyramids, images of differentsizes are randomly chosen in each iteration. This leads to notably better perfor-mance than multi-scale training, which will be demonstrated by experimentalresults in Section 5.2.

SNIP and SNIPER[17,18] These methods are the advanced versions of imagepyramid strategy. SNIP [17] was proposed to normalize the scales of objects inmulti-scale training. Regions of Interest (RoIs) fall outside the specified range ateach scale are considered invalid. SNIPER [18] utilizes patches as training data,instead of regular images. It crops selected regions around the ground-truth in-stances as positive chips and samples background as negative chips.

The operation of Stitcher is essentially different from that of SNIPER. Thecrop operation in SNIPER is much more complicated, which requires to calculatethe overlaps (IoU) between ground-truth boxes and crops for label assignment.However, the stitch operation in Stitcher only involves interpolation and con-catenation. Besides, as SNIP and SNIPER rely on multi-scale testing – they stillsuffer from an increase of inference time. For ablation upon the effect of testingstrategies, we compare their performance with Stitcher in Section 5.2.

Mixup[20] This Mixup technique was first introduced in image classifica-tion [20], to alleviate adversarial perturbation. Afterward, it was evaluated inobject detection [21]. It blends image pixels and merges group-truth labels withan empirical mixup ratio for adjustment. In the aspect of operations, stitchingand mixup are both concise.

In terms of performance, Stitcher is superior to mixup. As shown in [21],mixup improves the baseline (Faster R-CNN/ResNet-101/2×) by 0.2% AP with-out mixup during pre-training, and by 1.2% AP with mixup on both pre-trainingand fine-tuning phases. In contrast, Stitcher improves the same baseline by 2.3%AP without applying to the pre-training stage.

Auto Augmentation[4] This method learns augmentation strategies data-dependently [4,22]. The search space contains several pre-defined augmentationpolicies. It utilizes search algorithms, e.g., reinforcement learning, to optimizedata augmentation policies. Afterward, the top-performing ones are eqquippedto re-train networks.

Auto Augmentation [22] costs thousands of GPU days for offline process tocomplete search, which completely diverges from our goal. The performance ofStitcher, nevertheless, is similar to that of Auto Augmentation [22], (+1.6% byAuto Augmentation vs. +1.7% by Stitcher on Faster R-CNN with ResNet-50backbone). Also, auto augmentation methods usually involve much more com-plicated transformations, e.g., color operations (equalization, brightness) andgeometric operations (rotation, shearing).

Page 9: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

Stitcher 9

Scale-aware Network Instead of manipulating images, another line of effort onhandling scale variation is to design scale-aware neural networks, which usuallyfall into one of two categories: feature pyramids based and dilation based meth-ods. In terms of feature pyramids methods, SSD [15] detects objects with differ-ent scales, taking as input the feature maps at the corresponding layer. FPN [13]introduces lateral connections to build high-level semantic feature maps at allscales. On the other hand, dilation based methods adapt the receptive fields forobjects. Deformable convolution networks (DCN) [5] generalizes dilated convo-lution to adaptive receptive field learning. Trident network [12] constructs multibranches with various dilation to generate scale-specific features.

5 Experiments

5.1 Implementation Details

Experiments are performed on the COCO detection dataset [14] involving 80categories of objects. We train networks on the union of the primitive training set(containing 80k images) and a subset (35k images) of the primitive validation set.(trainval35k) and evaluate on a 5k subset of validation images (minival).

Following the common practice [10], backbone networks are first pre-trainedby image classification on ImageNet dataset [7]. Sequentially, these models,equipped with head sub-networks further, are fine-tuned on COCO supervisedwith object detection task. For ease of progress, we adopt pre-trained modelspublicly available1. Our implementation is based on the maskrcnn-benchmark2.

In the training stage, input images are resized such that the shorter side has800 pixels. We train networks on 8 GPUs (RTX 2080 TI) using synchronizedSGD. Unless otherwise specified, there are 2 images per GPU in each mini-batchand the whole batch-size is 16. Training settings directly follow the commonlyused configurations. We set the weight decay as 0.0001, momentum as 0.9, andthe initial learning rate as 0.02. In 1× training period, there are 90k iterations intotal. The learning rate is divided by 10 at 60k and 80k. In 2× training period, wedouble the total iteration number to 180k and the learning rate decline momentscorrespondingly to 120k and 160k.

For evaluation, we adopt the standard mean average precision (mAP) metric.Average Precisions (APs) corresponding to objects of small, medium, and largesizes (namely, APs, APm, and APl) [14] are also reported.

5.2 Evaluation on Object Detection

In this section, we compare Stitcher with common baselines and its competitor,including multi-scale training and SNIP/SNIPER.Comparison with common baselines. We evaluate the effect of Stitcher withdifferent detectors (Faster R-CNN and RetinaNet) and various training periods

1 https://dl.fbaipublicfiles.com/detectron/ImageNetPretrained/2 https://github.com/facebookresearch/maskrcnn-benchmark

Page 10: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

10 Stitcher

Table 2: Comparison with common baselines on Faster R-CNN

period backbone AP APs APm APl

Baseline

1×Res-50-FPN

36.7 21.1 39.9 48.1Stitcher 38.6 (+1.9) 24.4 (+3.3) 41.9 (+2.0) 49.3 (+1.2)Baseline

Res-101-FPN39.1 22.6 42.9 51.4

Stitcher 40.8 (+1.7) 25.8 (+3.3) 44.1 (+1.2) 51.9 (+0.5)

Baseline

2×Res-50-FPN

37.7 21.6 40.6 49.6Stitcher 39.9 (+2.2) 25.1 (+3.5) 43.1 (+2.5) 51.0 (+1.4)Baseline

Res-101-FPN39.8 22.9 43.3 52.6

Stitcher 42.1 (+2.3) 26.9 (+4.0) 45.5 (+2.2) 54.1 (+1.5)

Table 3: Comparison with common baselines on RetinaNet

period backbone AP APs APm APl

Baseline

1×Res-50-FPN

35.7 19.5 39.9 47.5Stitcher 37.8 (+2.1) 22.1 (+2.6) 41.6 (+1.7) 48.9 (+1.4)Baseline

Res-101-FPN37.7 20.6 41.8 50.8

Stitcher 39.9 (+2.2) 24.7 (+4.1) 44.1 (+2.3) 51.8 (+1.0)

Baseline

2×Res-50-FPN

36.8 20.2 40.0 49.7Stitcher 39.0 (+2.2) 23.4 (+3.2) 42.9 (+2.9) 51.0 (+1.2)Baseline

Res-101-FPN38.8 21.1 42.1 52.4

Stitcher 41.3 (+2.5) 25.4 (+4.3) 45.1 (+3.0) 54.0 (+1.6)

(1× and 2×), and show results in Tables 2 and 3. In all these cases, accuracysteadily improves when Stitcher is leveraged, especially for small objects. In de-tails, we observed the followings.(1) 2× training yields more increase than 1× training (+1.93% vs. +2.2% on 1×and 2× training respectively) on Faster R-CNN with ResNet-50.(2) In most cases, performance gain does not decay as backbones enlarged fromResNet-50 [11] to ResNet-101, except Faster R-CNN on 1× training.In short, the experimental results demonstrate that Stitcher brings consistentimprovement over the baselines and is robust to various settings (backbones,detection heads, and times of training periods).

Comparison with Multi-scale Training. Table 4 presents the comparisonwith the multi-scale training technique, i.e., randomly selecting a scale from[600, 700, 800, 900, 1000] to resize the shorter side of images. Stitcher yieldsbetter performance compared with multi-scale training. Apart from the perfor-mance gain, the following observations also deserve discussion.(1) In terms of accuricies, the advantages of Stitcher over mutli-scale training arelargely derived in small scales. They have approximately equal ability in detect-ing large objects. Such contrast validates our achievement towards the devisingpurpose to benefit mainly small object detection by image stitching.

Page 11: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

Stitcher 11

Table 4: Comparison with multi-scale training on Faster R-CNN

period backbone hours AP APs APm APl

Multi-scale

1×Res-50-FPN

10.5 37.2 21.6 40.3 48.6Stitcher 9.0 38.6 (+1.4) 24.4 (+2.8) 41.9 (+1.6) 49.3 (+0.7)Multi-scale

Res-101-FPN14.2 39.7 23.6 43.3 51.3

Stitcher 11.7 40.8 (+1.1) 25.8 (+2.2) 44.1 (+0.8) 51.9 (+0.6)

Multi-scale

2×Res-50-FPN

20.5 39.1 23.5 42.2 50.8Stitcher 17.5 39.9 (+0.8) 25.1 (+1.6) 43.1 (+0.9) 51.0 (+0.2)Multi-scale

Res-101-FPN28.5 41.6 25.5 45.3 54.1

Stitcher 23.5 42.1 (+0.5) 26.9 (+1.4) 45.5 (+0.2) 54.1 (+0.0)

Table 5: Comparison with SNIP and SNIPER on Faster R-CNN

backbone AP AP50 AP75 APs APm APl

SNIPRes-50-C4

43.6 65.2 48.8 26.4 46.5 55.8SNIPER 43.5 65.0 48.6 26.1 46.3 56.0Stitcher 44.2 64.6 48.4 28.7 47.2 58.3

SNIPRes-101-C4

44.4 66.2 49.9 27.3 47.4 56.9SNIPER 46.1 67.0 51.6 29.6 48.9 58.1Stitcher 46.9 67.5 51.4 30.9 50.5 60.9

(2) Stitcher is computationally economical than multi-scale training. They areboth trained with the same GPU (RTX 2080Ti). However, for the same trainingperiod, multi-scale training spends more time than Stitcher.

Comparison with SNIP and SNIPER. As shown in Table 5, we com-pare Stitcher with SNIP and SNIPER3 on Faster R-CNN with ResNet-50/101.Stitcher performs slightly better. Both SNIPER and Stitcher can be viewedas multi-scale training. However there exist some distinct differences. At first,Stitcher is simpler for implementation. SNIPER requires label assignment, validrange tuning and positive/negative chip selection. Second, Stitcher is feedback-driven where the optimization process focuses more on it shortcomings.

Evaluation on Large Backbones. Table 6 shows the improvement fromStitcher on large backbones, e.g., ResNext 101 [19], ResNet-101 with DCN [5]and ResNext-32×8d-101 with DCN [5]. Experiments are conducted on FasterR-CNN in 1× training period. Upon the higher baselines, Stitcher can still in-crease the performance by 1.0% to 1.5% AP, which demonstrates the robustnessof Stitcher to complicated cases.

Evaluation on Longer Training Periods. Table 7 shows the evaluation on

3 For fair comparisons, we use same augmentation strategies as SNIP and SNIPER,which includes deformable convolution, multi-scale testing, and soft-nms [3].

Page 12: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

12 Stitcher

Table 6: Evaluation on large backbones on Faster R-CNN

Backbone Method AP AP50 AP75 APs APm APl

ResNext 101Baseline 41.6 63.8 45.3 24.8 45.1 53.3Stitcher 43.1 (+1.5) 65.6 47.4 28.0 46.7 54.2

ResNet 101 + DCNBaseline 42.3 64.3 46.3 24.8 46.1 55.7Stitcher 43.3 (+1.0) 65.6 47.2 27.1 47.0 56.0

ResNext 101 + DCNBaseline 44.1 66.5 48.4 26.8 47.5 57.8Stitcher 45.4 (+1.3) 68.0 49.7 29.4 48.8 58.5

Table 7: Evaluation on longer training periods on Faster R-CNN

period AP AP50 AP75 APs APm APl

Baseline

1× 36.7 58.4 39.6 21.1 39.8 48.12× 37.7 59.2 41.0 21.6 40.6 49.64× 37.3 58.1 40.1 20.3 39.6 50.16× 35.6 55.9 38.4 19.8 37.7 47.6

Stitcher1× 38.6 60.5 41.8 24.4 41.9 49.36× 40.4 62.5 44.2 26.1 43.1 51.5

longer training periods on Faster R-CNN with ResNet-50 and FPN backbones.For 6× training, the performance of baseline is degraded by over-fitting, whileStitcher still maintains a promising accuracy. The composition of stitched imagesis not fixed, which enriches data patterns and prevents over-fitting.

In addition, a reasonable question is whether the improvement from Stitcheris caused by the increase of instances trained. The maximum instances involvedin 1× Stitcher is no more than those in the 4× baseline, while 1× Stitcher per-forms better than baseline in any training periods. It verifies that this factor isnot the main reason for the improvement.

Evaluation on PASCAL VOC. Although Stitcher is inspired by the findingson COCO, it is still effective on other datasets. We evaluate Stitcher on PascalVOC [8]. Following the protocol in [9], models are trained on the union of VOC2007 trainval and VOC 2012 trainval. Evaluation is performed on VOC 2007 test.A total of 24k iterations are performed on 8 GPUs. The learning rates are setas 0.01 and 0.001 in the first two-thirds and the remaining one-third iterationsrespectively. Experiments are conducted on Faster R-CNN with ResNet-50 andFPN. As shown in Table. 8, Stitcher brings 1.1% mAP improvement.

Evaluation on Instance Segmentation. Beyond object detection, we also ap-ply Stitcher to instance segmentation. Experiments are conducted on the COCOinstance segmentation track [14]. We report COCO mask AP on the mini-val split. Models are trained on 1× training period, for 90k iterations. It is

Page 13: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

Stitcher 13

Table 8: Evaluation on PASCAL VOC dataset on Faster R-CNNmAP plane bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv

Baseline 80.5 80.6 85.8 79.0 74.0 71.7 86.6 88.7 88.6 62.6 87.7 71.9 88.1 88.7 86.8 86.1 56.8 85.0 78.6 84.8 78.3

Stitcher 81.6 87.8 87.10 78.3 70.7 71.5 87.2 88.7 88.9 64.5 87.9 78.2 87.8 87.8 87.3 86.0 58.7 85.1 78.4 87.6 81.7

Table 9: Evaluation on Mask R-CNN

backbone AP APs APm APl

BaselineRes-50-FPN

34.3 15.8 36.7 50.5Stitcher 35.1 (+0.8) 17.0 (+1.2) 37.8 (+1.1) 51.4 (+0.9)

BaselineRes-101-FPN

35.9 15.9 38.9 53.2Stitcher 37.2 (+1.3) 19.0 (+3.1) 40.3 (+1.4) 53.7 (+0.5)

divided by 10 at 60k-th and 80k-th iterations. Training settings, including learn-ing rate, weight decay, momentum, and pre-trained weights, directly follow thedefault configuration. As shown in Table 9, performance increases AP by 0.9%on ResNet-50 and by 1.3% on ResNet-101, with the assistance of Stitcher.

5.3 Ablation Studies

In this section, we provide empirical analysis to each component of Stitcher,selection paradigm and the threshold τ . Ablations studies are conducted onFaster R-CNN with ResNet-50 and FPN as backbones in 1× training period.

Selection Paradigm. To evaluate the selection paradigm in Stitcher, severalexperiments are set up for comparison as in Table 10.

� All stitched: stitched images are utilized in all iterations;

� All regular: regular images are always used (the common baseline);

� Random sample: stitched or regular images are randomly sampled;

� Input feedback: this is a simplified version of Stitcher, where the feedbackratio is calculated on the number of small objects among the input batch;

� Classification/Regression/Total loss feedback: It is guided via loss feedback.

As shown in Table 10, if stitched images are consistently exploited, it achievesan unacceptable performance. Mere image stitching does not work and bringsno benefit. This reflects that the selection paradigm is indispensable in Stitcher.Random sampling can be viewed as a special version of multi-scale training. Itperforms better than the common baseline, but not promising. If the feedbackratio is based on the input, instead of loss, the accuracy is still higher than thebaseline but is slightly inferior to Stitcher. Input data can not be similarly pow-erful as loss to reflect the optimization process, because small objects still haveprobability to be ignored. These comparisons reflect the necessity of selection

Page 14: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

14 Stitcher

Table 10: Ablation study on selection paradigm

Selection Paradigm AP AP50 AP75 APs APm APl

No feedbackAll stitched 32.1 53.0 33.8 21.9 36.4 36.8All common 36.7 58.4 39.6 21.1 39.8 48.1Random sample 37.8 59.5 41.2 23.6 40.7 46.7

With feedback

Input feedback 38.1 60.0 41.2 23.1 41.3 49.1Classification feedback 38.5 60.6 41.9 23.9 41.6 48.8Regression feedback 38.6 60.5 41.8 24.4 41.9 49.3Total loss feedback 38.5 60.6 42.0 23.7 41.6 49.3

paradigm in Stitcher. In addition, Stitcher achieves stable performance no mat-ter which loss serves as feedback. Their performances are stable around 38.5 %to 38.6 % AP. For convenience, we pick regression loss as the common setting.

ᤒ໒ 1

Stitcher Baseline

0 36.71 36.710.1 38.64 36.710.2 38.43 36.710.3 38.08 36.710.4 36.94 36.710.5 34.82 36.710.6 32.96 36.710.7 32.1 36.710.8 32.09 36.710.9 31.82 36.711 32.06 36.71

Accu

racy

31

33

35

37

39

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Stitcher Baseline

�1

Fig. 8: Ablation study on threshold τ

Dimension k AP AP50 AP75 APs APm APl

Spatial 4 38.6 60.5 41.8 24.4 41.9 49.3

Batch

2 38.3 60.2 41.7 22.9 41.3 49.53 38.5 60.5 42.0 22.9 41.8 49.14 38.6 60.6 42.1 23.4 41.5 50.35 38.7 60.8 41.9 23.7 41.6 50.16 38.6 60.7 42.1 23.5 41.5 49.57 38.4 60.5 41.8 23.6 41.5 48.68 38.3 60.6 41.6 24.3 41.3 49.0

Table 11: Concatenation dimensions

Threshold Value. There is only one hyper-parameter introduced in Stitcher,the threshold value τ for selection. We study the impact of the threshold valuein Stitcher, as shown in Fig. 8. When the threshold is set below 0.4, Stitcherperforms better than the common baseline. Otherwise, the performance rapidlydecays to the ‘All stitched’ baseline. This observation verifies that setting thethreshold value as 0.1 brings a good balance.

5.4 Concatenation along the Batch Dimension

To make Stitcher flexible, we propose to aggregate resized images along thebatch dimension, instead of the original spatial dimension. Table 11 comparesthe results of these two implementation methods on various image numbers k inone stitched image. Observations can be summarized.(1) When k is 4, Stitcher achieves the same 38.6% AP accuracy in both stitchingmethods. It proves that these two implementations are equivalent to each other.(2) When other k values are adopted, Stitcher still achieves similar performance.

Page 15: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

Stitcher 15

The top performance is achieved when k is 5. Stitching along batch dimensionis robust to the variation of k, which equips Stitcher to fit different devices.

6 Conclusion

In this paper, we have proposed a simple yet effective data provider for objectdetection, termed Stitcher, which steadily enhances performance by a significantmargin. It can be easily applied to various detectors, backbones, training periods,datasets and even on other vision task like instance segmentation. Moreover, itrequires negligible additional computation during training and does not affectinference time. Abundant experiments have been conducted to verify its effec-tiveness. We hope Stitcher can serve as a common configuration in the future.

References

1. maskrcnn-benchmark. github.com/facebookresearch/maskrcnn-benchmark2. Adelson, E.H., Anderson, C.H., Bergen, J.R., Burt, P.J., Ogden, J.M.: Pyramid

methods in image processing. RCA engineer 29(6), 33–41 (1984)3. Bodla, N., Singh, B., Chellappa, R., Davis, L.S.: Soft-nms - improving object de-

tection with one line of code. In: ICCV. pp. 5562–5570 (2017)4. Cubuk, E.D., Zoph, B., Mane, D., Vasudevan, V., Le, Q.V.: Autoaugment: Learning

augmentation policies from data. In: CVPR. pp. 113–123 (2019)5. Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable convo-

lutional networks. In: ICCV. pp. 764–773 (2017)6. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:

CVPR. pp. 886–893 (2005)7. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Li, F.: Imagenet: A large-scale

hierarchical image database. In: CVPR. pp. 248–255 (2009)8. Everingham, M., Eslami, S.M.A., Gool, L.J.V., Williams, C.K.I., Winn, J.M., Zis-

serman, A.: The pascal visual object classes challenge: A retrospective. Interna-tional Journal of Computer Vision 111(1), 98–136 (2015)

9. Girshick, R.B.: Fast R-CNN. In: ICCV. pp. 1440–1448 (2015)10. Girshick, R.B., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac-

curate object detection and semantic segmentation. In: CVPR. pp. 580–587 (2014)11. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.

In: CVPR. pp. 770–778 (2016)12. Li, Y., Chen, Y., Wang, N., Zhang, Z.: Scale-aware trident networks for object

detection (2019)13. Lin, T., Dollar, P., Girshick, R.B., He, K., Hariharan, B., Belongie, S.J.: Feature

pyramid networks for object detection. In: CVPR. pp. 936–944 (2017)14. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollar, P.,

Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755(2014)

15. Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S.E., Fu, C., Berg, A.C.:SSD: single shot multibox detector. In: ECCV. pp. 21–37 (2016)

16. Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Interna-tional Journal of Computer Vision 60(2), 91–110 (2004)

Page 16: for Object Detection arXiv:2004.12432v1 [cs.CV] 26 …as feedback to guide next-iteration update. Experiments have been con-ducted on various detectors, backbones, training periods,

16 Stitcher

17. Singh, B., Davis, L.S.: An analysis of scale invariance in object detection SNIP.In: CVPR. pp. 3578–3587 (2018)

18. Singh, B., Najibi, M., Davis, L.S.: SNIPER: efficient multi-scale training. In:NeurIPS. pp. 9333–9343 (2018)

19. Xie, S., Girshick, R.B., Dollar, P., Tu, Z., He, K.: Aggregated residual transforma-tions for deep neural networks. In: CVPR. pp. 5987–5995 (2017)

20. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: Beyond empirical riskminimization. In: ICLR (2018)

21. Zhang, Z., He, T., Zhang, H., Zhang, Z., Xie, J., Li, M.: Bag of freebies for trainingobject detection neural networks. CoRR abs/1902.04103 (2019)

22. Zoph, B., Cubuk, E.D., Ghiasi, G., Lin, T., Shlens, J., Le, Q.V.: Learning dataaugmentation strategies for object detection. CoRR abs/1906.11172 (2019)


Recommended