Deep Residual Networks - ICML · Deep Residual Networks Deep Learning Gets Way Deeper 8:30-10:30am,...

Post on 20-May-2020

18 views 0 download

transcript

DeepResidualNetworksDeepLearningGetsWayDeeper

8:30-10:30am,June19ICML2016tutorial

KaimingHeFacebookAIResearch*

*asofJuly2016.FormerlyaffiliatedwithMicrosoftResearchAsia

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,128

,/2

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,256

,/2

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,512

,/2

3x3conv,512

1x1conv,2048

1x1conv,512

3x3conv,512

1x1conv,2048

1x1conv,512

3x3conv,512

1x1conv,2048

avepool,fc1

000

7x7conv

,64,/2,pool/2

Overview

• Introduction• Background• Fromshallowtodeep

• DeepResidualNetworks• From10layersto100layers• From100layersto1000layers

• Applications• Q&A

Introduction

Introduction

DeepResidualNetworks(ResNets)• “DeepResidualLearningforImageRecognition”.CVPR2016(nextweek)

• Asimpleandcleanframeworkoftraining“very”deepnets

• State-of-the-artperformancefor• Imageclassification• Objectdetection• Semanticsegmentation• andmore…

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

ResNets@ILSVRC&COCO2015Competitions

• 1stplacesinallfivemaintracks• ImageNetClassification:“Ultra-deep”152-layer nets• ImageNetDetection: 16% betterthan2nd• ImageNetLocalization: 27% betterthan2nd• COCODetection: 11% betterthan2nd• COCOSegmentation: 12% betterthan2nd

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

*improvementsarerelativenumbers

RevolutionofDepth

3.57

6.7 7.3

11.7

16.4

25.828.2

ILSVRC'15ResNet

ILSVRC'14GoogleNet

ILSVRC'14VGG

ILSVRC'13 ILSVRC'12AlexNet

ILSVRC'11 ILSVRC'10

ImageNetClassificationtop-5error(%)

shallow8layers

19layers22layers

152layers

8layers

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

RevolutionofDepth11x11conv,96,/4,pool/2

5x5conv,256,pool/2

3x3conv,384

3x3conv,384

3x3conv,256,pool/2

fc,4096

fc,4096

fc,1000

AlexNet,8layers(ILSVRC2012)

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

RevolutionofDepth11x11conv,96,/4,pool/2

5x5conv,256,pool/2

3x3conv,384

3x3conv,384

3x3conv,256,pool/2

fc,4096

fc,4096

fc,1000

AlexNet,8layers(ILSVRC2012)

3x3conv,64

3x3conv,64,pool/2

3x3conv,128

3x3conv,128,pool/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256,pool/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512,pool/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512,pool/2

fc,4096

fc,4096

fc,1000

VGG,19layers(ILSVRC2014)

input

Conv7x7+ 2(S)

MaxPool 3x3+ 2(S)

LocalRespNorm

Conv1x1+ 1(V)

Conv3x3+ 1(S)

LocalRespNorm

MaxPool 3x3+ 2(S)

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

MaxPool 3x3+ 2(S)

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

AveragePool 5x5+ 3(V)

Dept hConcat

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

AveragePool 5x5+ 3(V)

Dept hConcat

MaxPool 3x3+ 2(S)

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

AveragePool 7x7+ 1(V)

FC

Conv1x1+ 1(S)

FC

FC

Soft maxAct ivat ion

soft max0

Conv1x1+ 1(S)

FC

FC

Soft maxAct ivat ion

soft max1

Soft maxAct ivat ion

soft max2

GoogleNet,22layers(ILSVRC2014)

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,128,/2

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,256,/2

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,512,/2

3x3conv,512

1x1conv,2048

1x1conv,512

3x3conv,512

1x1conv,2048

1x1conv,512

3x3conv,512

1x1conv,2048

avepool,fc1000

7x7conv,64,/2,pool/2

AlexNet,8layers(ILSVRC2012)

RevolutionofDepthResNet,152layers(ILSVRC2015)

3x3conv,64

3x3conv,64,pool/2

3x3conv,128

3x3conv,128,pool/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256,pool/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512,pool/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512,pool/2

fc,4096

fc,4096

fc,1000

11x11conv,96,/4,pool/2

5x5conv,256,pool/2

3x3conv,384

3x3conv,384

3x3conv,256,pool/2

fc,4096

fc,4096

fc,1000

VGG,19layers(ILSVRC2014)

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

RevolutionofDepth

34

5866

86

HOG,DPM AlexNet(RCNN)

VGG(RCNN)

ResNet(FasterRCNN)*

PASCALVOC2007ObjectDetectionmAP (%)

shallow8layers

16layers

101layers

*w/otherimprovements&moredata

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

Enginesofvisualrecognition

ResNet’s objectdetectionresultonCOCO*theoriginalimageisfromtheCOCOdataset

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

Verysimple,easytofollow

• Manythird-partyimplementations(listinhttps://github.com/KaimingHe/deep-residual-networks)• FacebookAIResearch’sTorchResNet:• Torch,CIFAR-10,withResNet-20toResNet-110,trainingcode,andcurves:code• Lasagne,CIFAR-10,withResNet-32andResNet-56andtrainingcode:code• Neon,CIFAR-10,withpre-trainedResNet-32toResNet-110models,trainingcode,andcurves: code• Torch,MNIST,100layers:blog,code• AwinningentryinKaggle's rightwhalerecognitionchallenge:blog,code• Neon,Place2(mini),40layers:blog,code• …

• Easilyreproducedresults(e.g.TorchResNet:https://github.com/facebook/fb.resnet.torch)

• Aseriesofextensionsandfollow-ups• >200citationsin6monthsafterpostedonarXiv (Dec.2015)

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

Background

Fromshallowtodeep

Traditionalrecognition

edges classifier “bus”?

pixelsclassifier “bus”?

histogram classifier “bus”?edges

SIFT/HOG

histogram classifier “bus”?edges K-means/sparsecode

shallower

deeper

Butwhat’snext?

DeepLearning

histogram classifier “bus”?edges K-means/sparsecode

Specializedcomponents, domainknowledge required

“bus”?

Genericcomponents (“layers”),lessdomainknowledge

“bus”?

Repeatelementarylayers=>Goingdeeper

• End-to-endlearning• Richersolutionspace

SpectrumofDepth

shallower deeper

5layers:easy

>10layers:initialization,BatchNormalization

>30layers:skipconnections

>100layers:identityskipconnections

>1000layers:?

Initialization

LeCun etal1998“EfficientBackprop”Glorot &Bengio 2010“Understandingthedifficultyoftrainingdeepfeedforwardneuralnetworks”

input𝑋

output𝑌 = 𝑊𝑋

weight𝑊

1-layer:𝑉𝑎𝑟 𝑦 = (𝑛+,𝑉𝑎𝑟 𝑤 )𝑉𝑎𝑟[𝑥]

Multi-layer:

𝑉𝑎𝑟 𝑦 = (2𝑛3+,𝑉𝑎𝑟 𝑤33

)𝑉𝑎𝑟[𝑥]

If:• Linearactivation• 𝑥, 𝑦,𝑤:independentThen:

𝑛+, 𝑛567

Initialization

LeCun etal1998“EfficientBackprop”Glorot &Bengio 2010“Understandingthedifficultyoftrainingdeepfeedforwardneuralnetworks”

1 3 5 7 9 11 13 15depth

exploding

vanishing

ideal

Forward:

𝑉𝑎𝑟 𝑦 = (2𝑛3+,𝑉𝑎𝑟 𝑤33

)𝑉𝑎𝑟[𝑥]

Backward:

𝑉𝑎𝑟𝜕𝜕𝑥 = (2𝑛3567𝑉𝑎𝑟 𝑤3

3

)𝑉𝑎𝑟[𝜕𝜕𝑦]

Bothforward(response)andbackward(gradient)signalcanvanish/explode

Initialization

• Initializationunderlinear assumption

LeCun etal1998“EfficientBackprop”Glorot &Bengio 2010“Understandingthedifficultyoftrainingdeepfeedforwardneuralnetworks”

∏ 𝑛3+,𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡>? (healthyforward)and

∏ 𝑛3567𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡@?(healthybackward)

𝑛3+,𝑉𝑎𝑟 𝑤3 = 1or*

𝑛3567𝑉𝑎𝑟 𝑤3 = 1

*:𝑛3567 = 𝑛3BC+, ,soD5,E7FGD5,E7HG= ,IJKL

MNO

,HPQKLRS < ∞.

Itissufficienttouseeitherform.

“Xavier”init inCaffe

Initialization

• InitializationunderReLU activation

∏ CV𝑛3

+,𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡>? (healthyforward)and

∏ CV𝑛3

567𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡@?(healthybackward)

12𝑛3

+,𝑉𝑎𝑟 𝑤3 = 1or

12𝑛3

567𝑉𝑎𝑟 𝑤3 = 1

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DelvingDeepintoRectifiers:SurpassingHuman-LevelPerformanceonImageNetClassification”.ICCV2015.

With𝐷 layers,afactorof2 perlayerhasexponentialimpactof2Y

“MSRA”init inCaffe

Initialization

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DelvingDeepintoRectifiers:SurpassingHuman-LevelPerformanceonImageNetClassification”.ICCV2015.

ours

Xavier

22-layerReLU net:goodinit convergesfaster

𝑛𝑉𝑎𝑟 𝑤 = 1oursXavier

30-layerReLU net:goodinit isabletoconverge

12𝑛𝑉𝑎𝑟 𝑤 = 1

12𝑛𝑉𝑎𝑟 𝑤 = 1

𝑛𝑉𝑎𝑟 𝑤 = 1

*Figuresshowthebeginningoftraining

BatchNormalization(BN)

• Normalizinginput(LeCun etal1998“EfficientBackprop”)

• BN:normalizingeachlayer,foreachmini-batch

• Greatlyacceleratetraining

• Lesssensitivetoinitialization

• Improveregularization

S.Ioffe &C.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducing internalcovariateshift.ICML2015

BatchNormalization(BN)

S.Ioffe &C.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducing internalcovariateshift.ICML2015

layer 𝑥 𝑥Z =𝑥 − 𝜇𝜎 𝑦 = 𝛾𝑥Z + 𝛽

• 𝜇:meanof𝑥 inmini-batch• 𝜎:std of𝑥 inmini-batch• 𝛾:scale• 𝛽:shift

• 𝜇,𝜎:functionsof𝑥,analogoustoresponses

• 𝛾, 𝛽:parameterstobelearned,analogoustoweights

BatchNormalization(BN)

S.Ioffe &C.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducing internalcovariateshift.ICML2015

layer 𝑥 𝑥Z =𝑥 − 𝜇𝜎 𝑦 = 𝛾𝑥Z + 𝛽

2modesofBN:• Trainmode:• 𝜇,𝜎 arefunctionsof𝑥;backprop gradients

• Testmode:• 𝜇,𝜎 arepre-computed*ontrainingset

*:byrunning average,orpost-processing aftertraining

Caution:makesureyourBNisinacorrectmode

BatchNormalization(BN)

S.Ioffe &C.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducing internalcovariateshift.ICML2015

Figure takenfrom[S.Ioffe &C.Szegedy]

w/oBNbestofw/BNaccuracy

iter.

DeepResidualNetworks

From10layersto100layers

GoingDeeper

• Initializationalgorithms✓• BatchNormalization✓

• Islearningbetternetworksassimpleasstackingmorelayers?

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

Simplystackinglayers?

0 1 2 3 4 5 60

10

20

iter. (1e4)

trainerror(%)

0 1 2 3 4 5 60

10

20

iter. (1e4)

testerror(%)CIFAR-10

56-layer

20-layer

56-layer

20-layer

• Plain nets:stacking3x3convlayers…• 56-layernethashighertrainingerror andtesterrorthan20-layernet

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

Simplystackinglayers?

0 1 2 3 4 5 60

5

10

20

iter. (1e4)

erro

r (%

)

plain-20plain-32plain-44plain-56

CIFAR-10

20-layer32-layer44-layer56-layer

0 10 20 30 40 5020

30

40

50

60

iter. (1e4)

erro

r (%

)

plain-18plain-34

ImageNet-1000

34-layer

18-layer

• “Overlydeep”plainnetshavehighertrainingerror• Ageneralphenomenon,observedinmanydatasets

solid:test/valdashed:train

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

7x7conv,64,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

fc1000

ashallowermodel

(18layers)

adeepercounterpart(34layers)

7x7conv,64,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

fc1000

“extra”layers

• Richersolutionspace

• Adeepermodelshouldnothavehighertrainingerror

• Asolutionbyconstruction:• originallayers:copiedfroma

learnedshallowermodel• extralayers:setasidentity• atleastthesametrainingerror

• Optimizationdifficulties:solverscannotfindthesolutionwhengoingdeeper…

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

DeepResidualLearning

• Plaintnet

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

anytwostackedlayers

𝑥

𝐻(𝑥)

weightlayer

weightlayer

relu

relu

𝐻 𝑥 isanydesiredmapping,

hopethe2weightlayersfit𝐻(𝑥)

DeepResidualLearning

• Residual net

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

𝐻 𝑥 isanydesiredmapping,

hopethe2weightlayersfit𝐻(𝑥)

hope the2weightlayersfit𝐹(𝑥)

let𝐻 𝑥 = 𝐹 𝑥 + 𝑥weightlayer

weightlayer

relu

relu

𝑥

𝐻 𝑥 = 𝐹 𝑥 + 𝑥

identity𝑥

𝐹(𝑥)

DeepResidualLearning

• 𝐹 𝑥 isaresidual mappingw.r.t.identity

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

• Ifidentitywereoptimal,easytosetweightsas0

• Ifoptimalmappingisclosertoidentity,easiertofindsmallfluctuations

weightlayer

weightlayer

relu

relu

𝑥

𝐻 𝑥 = 𝐹 𝑥 + 𝑥

identity𝑥

𝐹(𝑥)

RelatedWorks– ResidualRepresentations

• VLAD&FisherVector[Jegou etal2010],[Perronnin etal2007]

• Encodingresidual vectors;powerfulshallowerrepresentations.

• ProductQuantization(IVF-ADC)[Jegou etal2011]

• Quantizingresidual vectors;efficientnearest-neighborsearch.

• MultiGrid&HierarchicalPrecondition[Briggs,etal2000],[Szeliski1990,2006]• Solvingresidual sub-problems;efficientPDEsolvers.

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

Network“Design”

• Keepitsimple

• Ourbasicdesign (VGG-style)• all3x3conv(almost)

• spatialsize/2=>#filtersx2(~samecomplexityperlayer)

• Simpledesign;justdeep!

• Otherremarks:• nohiddenfc• nodropout

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

plainnet ResNet

Training

• Allplain/residualnetsaretrainedfromscratch

• Allplain/residualnetsuseBatchNormalization

• Standardhyper-parameters&augmentation

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

CIFAR-10experiments

0 1 2 3 4 5 60

5

10

20

iter. (1e4)

erro

r (%

)

plain-20plain-32plain-44plain-56

20-layer32-layer44-layer56-layer

CIFAR-10plainnets

0 1 2 3 4 5 60

5

10

20

iter. (1e4)

erro

r (%

)

ResNet-20ResNet-32ResNet-44ResNet-56ResNet-110

CIFAR-10ResNets

56-layer44-layer32-layer20-layer

110-layer

• DeepResNetscanbetrainedwithoutdifficulties• DeeperResNetshavelowertrainingerror,andalsolowertesterror

solid:testdashed:train

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

ImageNetexperiments

0 10 20 30 40 5020

30

40

50

60

iter. (1e4)

erro

r (%

)

ResNet-18ResNet-34

0 10 20 30 40 5020

30

40

50

60

iter. (1e4)

erro

r (%

)

plain-18plain-34

ImageNetplainnets ImageNetResNets

solid:testdashed:train

34-layer

18-layer

18-layer

34-layer

• DeepResNetscanbetrainedwithoutdifficulties• DeeperResNetshavelowertrainingerror,andalsolowertesterror

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

ImageNetexperiments

• Apracticaldesignofgoingdeeper

3x3,64

3x3,64

relu

relu

64-d

3x3,64

1x1,64relu

1x1,256relu

relu

256-d

all-3x3

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

bottleneck(forResNet-50/101/152)

similarcomplexity

ImageNetexperiments7.4

6.7

6.15.7

4

5

6

7

8

ResNet-34ResNet-50ResNet-101ResNet-15210-crop testing,top-5val error(%)

thismodelhaslowertimecomplexity

thanVGG-16/19

• Deeper ResNetshavelower error

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

ImageNetexperiments

3.57

6.7 7.3

11.7

16.4

25.828.2

ILSVRC'15ResNet

ILSVRC'14GoogleNet

ILSVRC'14VGG

ILSVRC'13 ILSVRC'12AlexNet

ILSVRC'11 ILSVRC'10

ImageNetClassificationtop-5error(%)

shallow8layers

19layers22layers

152layers

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

8layers

DiscussionsRepresentation,Optimization,Generalization

Issuesonlearningdeepmodels

•Representation ability

•Optimization ability

•Generalization ability

• Abilityofmodeltofittrainingdata,ifoptimumcouldbefound

• IfmodelA’ssolutionspaceisasupersetofB’s,Ashouldbebetter.

• Feasibilityoffindinganoptimum

• Notallmodelsareequallyeasytooptimize

• Oncetrainingdataisfit,howgoodisthetestperformance

HowdoResNetsaddresstheseissues?

•Representation ability

•Optimization ability

•Generalization ability

• Noexplicitadvantageonrepresentation(onlyre-parameterization),but

• Allowmodelstogodeeper

• Enableverysmoothforward/backwardprop

• Greatlyeaseoptimizingdeeper models

• Notexplicitlyaddressgeneralization,but

• Deeper+thinner isgoodgeneralization

OntheImportanceofIdentityMapping

From100layersto1000layers

Onidentitymappingsforoptimization

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )

𝑥𝑙

ℎ(𝑥c)𝐹(𝑥𝑙)layer

layer

• shortcutmapping:ℎ =identity

• after-addmapping:𝑓 =ReLU

Onidentitymappingsforoptimization

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )

𝑥𝑙

ℎ(𝑥c)𝐹(𝑥𝑙)layer

layer

• shortcutmapping:ℎ =identity

• after-addmapping:𝑓 =ReLU

• Whatif𝑓 =identity?

Onidentitymappingsforoptimization

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )

𝑥𝑙

ℎ(𝑥c)𝐹(𝑥𝑙)layer

layer

• shortcutmapping:ℎ =identity

• after-addmapping:𝑓 =ReLU

• Whatif𝑓 =identity?

Verysmoothforwardpropagation

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

𝑥cBC = 𝑥c + 𝐹 𝑥c

𝑥cBV = 𝑥cBC + 𝐹 𝑥cBC

Verysmoothforwardpropagation

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

𝑥cBC = 𝑥c + 𝐹 𝑥c

𝑥cBV = 𝑥c + 𝐹 𝑥c + 𝐹 𝑥cBC

𝑥cBV = 𝑥cBC + 𝐹 𝑥cBC

Verysmoothforwardpropagation

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

𝑥cBC = 𝑥c + 𝐹 𝑥c

𝑥cBV = 𝑥c + 𝐹 𝑥c + 𝐹 𝑥cBC

𝑥cBV = 𝑥cBC + 𝐹 𝑥cBC

𝑥g = 𝑥c +h𝐹 𝑥+

giC

+jc

Verysmoothforwardpropagation

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

𝑥g = 𝑥c +h𝐹 𝑥+

giC

+jc

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

𝑥g

𝑥c• Any𝑥c isdirectly forward-proptoany𝑥g,plus residual.• Any𝑥g isanadditiveoutcome.• incontrastto multiplicative:𝑥g = ∏ 𝑊+𝑥cgiC

+jc

Verysmoothbackwardpropagation

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

𝑥g = 𝑥c +h𝐹 𝑥+

giC

+jc

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

𝜕𝐸𝜕𝑥g

𝜕𝐸𝜕𝑥c

𝜕𝐸𝜕𝑥c

=𝜕𝐸𝜕𝑥g

𝜕𝑥g𝜕𝑥c

=𝜕𝐸𝜕𝑥g

(1 +𝜕𝜕𝑥c

h𝐹 𝑥+

giC

+jC

)

Verysmoothbackwardpropagation

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,256

3x3conv,512,/2

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

3x3conv,512

avgpool

fc1000

𝜕𝐸𝜕𝑥g

𝜕𝐸𝜕𝑥c

𝜕𝐸𝜕𝑥c

=𝜕𝐸𝜕𝑥g

(1 +𝜕𝜕𝑥c

h𝐹 𝑥+

giC

+jC

)

• Any lmlno

isdirectly back-proptoanylmlnp,

plus residual.

• Anylmlnp

is additive;unlikelytovanish

• incontrastto multiplicative:lmlnp = ∏ 𝑊+lmlno

giC+jc

Residualforeverylayer

𝑥g = 𝑥c +h𝐹 𝑥+

giC

+jc

forward:

𝜕𝐸𝜕𝑥c

=𝜕𝐸𝜕𝑥g

(1 +𝜕𝜕𝑥c

h𝐹 𝑥+

giC

+jC

)backward:

Enabledby:

• shortcutmapping:ℎ =identity

• after-addmapping:𝑓 =identity

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

Experiments

• Set1:whatifshortcutmappingℎ ≠identity

• Set2:whatifafter-addmapping𝑓 isidentity

• ExperimentsonResNetswithmorethan100layers• deepermodelssuffermorefromoptimizationdifficulty

ExperimentSet1:whatifshortcutmappingℎ ≠identity?

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

(f) dropout shortcut(e) conv shortcut

3x3conv

3x3conv

additionReLU

1x1convReLU

3x3conv

3x3conv

addition

dropoutReLU

ReLU

(d) shortcut-only gating(c) exclusive gating

3x3conv

3x3conv

addition

1x1convsigmoid

1-

ReLU

ReLU

3x3conv

3x3conv

addition

1x1convsigmoid

1-

ReLU

ReLU

(a) original (b) constant scaling

3x3conv

3x3conv

additionReLU

ReLU

3x3conv

3x3conv

addition

0.5 0.5

ReLU

ReLU

ℎ 𝑥 = 𝑥error:6.6%

ℎ 𝑥 = 0.5𝑥error:12.4%

ℎ 𝑥 = gate ·𝑥error:8.7%

ℎ 𝑥 = gate ·𝑥error:12.9%

ℎ 𝑥 = conv(𝑥)error:12.2%

ℎ 𝑥 = dropout(𝑥)error:>20%

*ResNet-110onCIFAR-10

*similarto“HighwayNetwork”

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

(f) dropout shortcut(e) conv shortcut

3x3conv

3x3conv

additionReLU

1x1convReLU

3x3conv

3x3conv

addition

dropoutReLU

ReLU

(d) shortcut-only gating(c) exclusive gating

3x3conv

3x3conv

addition

1x1convsigmoid

1-

ReLU

ReLU

3x3conv

3x3conv

addition

1x1convsigmoid

1-

ReLU

ReLU

(a) original (b) constant scaling

3x3conv

3x3conv

additionReLU

ReLU

3x3conv

3x3conv

addition

0.5 0.5

ReLU

ReLU

ℎ 𝑥 = 𝑥error:6.6%

ℎ 𝑥 = 0.5𝑥error:12.4%

ℎ 𝑥 = gate ·𝑥error:8.7%

ℎ 𝑥 = gate ·𝑥error:12.9%

ℎ 𝑥 = conv(𝑥)error:12.2%

ℎ 𝑥 = dropout(𝑥)error:>20%

shortcutsblockedby

multiplications

*ResNet-110onCIFAR-10

Ifℎ ismultiplicative,e.g.ℎ 𝑥 = λ𝑥

𝑥g = λgic𝑥c +h𝐹u 𝑥+

giC

+jc

forward:

𝜕𝐸𝜕𝑥c

=𝜕𝐸𝜕𝑥g

(λgic +𝜕𝜕𝑥c

h𝐹u 𝑥+

giC

+jC

)backward:

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

• ifℎ ismultiplicative,shortcutsareblocked

• directpropagationisdecayed

*assuming𝑓 =identity

3x3conv

3x3conv

addition

1x1convsigmoid

1-

ReLU

ReLU

3x3conv

3x3conv

additionReLU

ReLU

ℎ isidentity

ℎ isgating

• gatingshouldhavebetterrepresentationability(identityisaspecialcase),but

• optimizationdifficultydominatesresultsKaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

solid:testdashed:train

ExperimentSet2:whatifafter-addmapping𝑓 isidentity

BN

ReLU

weight

BN

weight

addition

ReLU

xl

xl+1

BN

ReLU

weight

BN

weight

addition

ReLU

xl

xl+1

ReLU

weight

BN

ReLU

weight

BN

addition

xl

xl+1

𝑓 isReLU(originalResNet)

𝑓 isBN+ReLU 𝑓 isidentity(pre-activation ResNet)

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

BN

ReLU

weight

BN

weight

addition

ReLU

xl

xl+1

BN

ReLU

weight

BN

weight

addition

ReLU

xl

xl+1

𝑓 = ReLU 𝑓 = BN+ReLU

𝑓 = ReLU

𝑓 = BN+ReLU

• BNcouldblockprop• Keeptheshortestpassas

smoothaspossible

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

solid:testdashed:train

BN

ReLU

weight

BN

weight

addition

ReLU

xl

xl+1

𝑓 = ReLU 𝑓 = identity

ReLU

weight

BN

ReLU

weight

BN

addition

xl

xl+1

1001-layer ResNetsonCIFAR-10

𝑓 = ReLU𝑓 = identity

• ReLUcouldblockpropwhenthereare1000layers

• pre-activationdesigneasesoptimization(andimprovesgeneralization;seepaper)

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

solid:testdashed:train

method error (%)

NIN 8.81

DSN 8.22

FitNet 8.39

Highway 7.72

ResNet-110 (1.7M) 6.61

ResNet-1202 (19.4M) 7.93

ResNet-164, pre-activation (1.7M) 5.46

ResNet-1001, pre-activation (10.2M) 4.92 (4 .89±0 .14 )

method error (%)NIN 35.68

DSN 34.57

FitNet 35.04

Highway 32.39

ResNet-164(1.7M) 25.16

ResNet-1001(10.2M) 27.82

ResNet-164, pre-activation (1.7M) 24.33

ResNet-1001, pre-activation (10.2M) 22.71 (22 .68±0 .22 )

ComparisonsonCIFAR-10/100CIFAR-10 CIFAR-100

*allbasedonmoderateaugmentation

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

ImageNetExperiments

method data augmentation top-1error(%) top-5error(%)ResNet-152, original scale 21.3 5.5ResNet-152, pre-activation scale 21.1 5.5ResNet-200, original scale 21.8 6.0ResNet-200, pre-activation scale 20.7 5.3ResNet-200, pre-activation scale+aspectratio 20.1* 4.8*

*independentlyreproducedby:https://github.com/facebook/fb.resnet.torch/tree/master/pretrained#notes

trainingcodeandmodelsavailable.

ImageNetsingle-crop(320x320)val error

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

Summaryofobservations

• Keeptheshortestpathassmoothaspossible• bymakingℎ and𝑓 identity• forward/backwardsignalsdirectlyflowthroughthispath

• Featuresofanylayersareadditiveoutcomes

• 1000-layer ResNetscanbeeasilytrainedandhavebetteraccuracy

ReLU

weight

BN

ReLU

weight

BN

addition

xl

xl+1

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

FutureWorks

• Representation• skipping1layervs.multiplelayers?• Flatvs.Bottleneck?• Inception-ResNet [Szegedy etal2016]• ResNet inResNet [Targ etal2016]• Widthvs.Depth[Zagoruyko &Komodakis 2016]

• Generalization• DropOut,MaxOut,DropConnect,…• DropLayer(StochasticDepth)[Huangetal2016]

• Optimization• Withoutresidual/shortcut?

ReLU

weight

BN

ReLU

weight

BN

addition

xl

xl+1

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

Applications

“Featuresmatter”

“Featuresmatter.”(quote[Girshicketal.2014],theR-CNNpaper)

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

task 2nd-placewinner ResNets margin

(relative)

ImageNetLocalization(top-5error) 12.0 9.0 27%

ImageNetDetection(mAP@.5) 53.6 62.1 16%

COCO Detection(mAP@.5:.95) 33.5 37.3 11%

COCOSegmentation(mAP@.5:.95) 25.1 28.2 12%

• OurresultsareallbasedonResNet-101• Deeperfeaturesarewelltransferrable

absolute8.5%better!

RevolutionofDepth

34

5866

86

HOG,DPM AlexNet(RCNN)

VGG(RCNN)

ResNet(FasterRCNN)*

PASCALVOC2007ObjectDetectionmAP (%)

shallow8layers

16layers

101layers

*w/otherimprovements&moredata

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

Enginesofvisualrecognition

DeepLearningforComputerVision

backbonestructure

ImageNetdata

classificationnetwork

pre-train features

detectionnetwork

(e.g.R-CNN)

segmentationnetwork(e.g.FCN)

…...

humanposeestimationnetwork

depthestimationnetwork

targetdata

fine-tune

Example:ObjectDetection

ü boatü person

ImageClassification(what?)

ObjectDetection(what+where?)

ObjectDetection:R-CNN

regionproposals~2,000

1CNNforeachregion

Region-based CNNpipeline

figurecredit:R.Girshicketal.

aeroplane? no.

..person? yes.

tvmonitor? no.

warped region..

CNN

inputimage classifyregions

Girshick,Donahue, Darrell,Malik.RichFeatureHierarchiesforAccurateObjectDetectionandSemanticSegmentation.CVPR2014

ObjectDetection:R-CNN

• R-CNN

Girshick,Donahue, Darrell,Malik.RichFeatureHierarchiesforAccurateObjectDetectionandSemanticSegmentation.CVPR2014

CNN

feature

image

CNN

feature

CNN

feature

CNN

feature

pre-computedRegions-of-Interest

(RoIs)

End-to-Endtraining

pre-computedRegions-of-Interest

(RoIs)

image

CNN

feature

featurefeature

ObjectDetection:FastR-CNN

• FastR-CNN

Girshick.FastR-CNN.ICCV2015

End-to-Endtraining

sharedconvlayers

RoI pooling

ObjectDetection:FasterR-CNN

• FasterR-CNN• SolelybasedonCNN• Noexternalmodules• Eachstepisend-to-end

End-to-Endtraining

image

CNN

featuremap

RegionProposalNet

proposals

features

RoI pooling

Shaoqing Ren,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.

ObjectDetection

backbonestructure

ImageNetdata

classificationnetwork

pre-train features

detectionnetwork

detectiondata

fine-tune

• AlexNet• VGG-16• GoogleNet• ResNet-101• …

• R-CNN• FastR-CNN• FasterR-CNN• MultiBox• SSD• …

“plug-in”features detectors

independentlydeveloped

ObjectDetection

• Simply“FasterR-CNN+ResNet”

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.

image

CNN

featuremap

RegionProposalNet

proposals

classifier

RoI pooling

FasterR-CNNbaseline mAP@.5 mAP@.5:.95

VGG-16 41.5 21.5ResNet-101 48.4 27.2

COCOdetection resultsResNet-101has28%relativegain

vsVGG-16

ObjectDetection

• RPNlearns proposalsbyextremelydeepnets• Weuseonly300proposals(nohand-designedproposals)

• Addcomponents:• Iterativelocalization• Contextmodeling• Multi-scaletesting

• AllarebasedonCNNfeatures;allareend-to-end

• Allbenefitmore fromdeeper features– cumulativegains!

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.

ResNet’s objectdetectionresultonCOCO*theoriginalimageisfromtheCOCOdataset

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.arXiv2015.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.arXiv2015.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.

*theoriginalimageisfromtheCOCOdataset

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.arXiv2015.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.

*theoriginalimageisfromtheCOCOdataset

Resultsonrealvideo.ModelstrainedonMSCOCO(80categories).(frame-by-frame;notemporalprocessing)

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.arXiv2015.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.

thisvideoisavailableonline:https://youtu.be/WZmSMkK9VuA

MoreVisualRecognitionTasks

ResNet-basedmethodsleadonthesebenchmarks(incompletelist):• ImageNetclassification,detection,localization• MSCOCOdetection,segmentation• PASCALVOCdetection,segmentation

• Humanposeestimation[Newelletal2016]• Depthestimation[Laina etal2016]• Segmentproposal[Pinheiro etal2016]• …

PASCALdetectionleaderboard

PASCALsegmentationleaderboard

ResNet-101

ResNet-101

PotentialApplications

ResNetshaveshownoutstandingorpromisingresultson:

VisualRecognition

ImageGeneration(PixelRNN,NeuralArt,etc.)

NaturalLanguageProcessing(VerydeepCNN)

SpeechRecognition(preliminaryresults)

Advertising,userprediction(preliminaryresults)

ConclusionsoftheTutorial

• DeepResidualLearning:• Ultradeepnetworkscanbeeasytotrain• Ultradeepnetworkscangainaccuracyfromdepth• Ultradeeprepresentationsarewelltransferrable• Now200 layersonImageNetand1000layersonCIFAR!

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

Resources

• ModelsandCode• OurImageNetmodelsinCaffe:https://github.com/KaimingHe/deep-residual-networks

• Manyavailableimplementation(listinhttps://github.com/KaimingHe/deep-residual-networks)

• FacebookAIResearch’sTorchResNet:https://github.com/facebook/fb.resnet.torch

• Torch,CIFAR-10,withResNet-20toResNet-110,trainingcode,andcurves:code• Lasagne,CIFAR-10,withResNet-32andResNet-56andtrainingcode:code• Neon,CIFAR-10,withpre-trainedResNet-32toResNet-110models,trainingcode,andcurves:code• Torch,MNIST,100layers:blog,code• AwinningentryinKaggle's rightwhalerecognitionchallenge:blog,code• Neon,Place2(mini),40layers:blog,code• …....

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.