Deep Residual Networks - ICML · Deep Residual Networks Deep Learning Gets Way Deeper 8:30-10:30am,...

transcript

DeepResidualNetworksDeepLearningGetsWayDeeper

8:30-10:30am,June19ICML2016tutorial

KaimingHeFacebookAIResearch*

*asofJuly2016.FormerlyaffiliatedwithMicrosoftResearchAsia

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,512

3x3conv,512

1x1conv,2048

1x1conv,512

3x3conv,512

1x1conv,2048

1x1conv,512

3x3conv,512

1x1conv,2048

avepool,fc1

7x7conv

,64,/2,pool/2

Overview

• Introduction• Background• Fromshallowtodeep

• DeepResidualNetworks• From10layersto100layers• From100layersto1000layers

• Applications• Q&A

Introduction

DeepResidualNetworks(ResNets)• “DeepResidualLearningforImageRecognition”.CVPR2016(nextweek)

• Asimpleandcleanframeworkoftraining“very”deepnets

• State-of-the-artperformancefor• Imageclassification• Objectdetection• Semanticsegmentation• andmore…

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.

ResNets@ILSVRC&COCO2015Competitions

• 1stplacesinallfivemaintracks• ImageNetClassification:“Ultra-deep”152-layer nets• ImageNetDetection: 16% betterthan2nd• ImageNetLocalization: 27% betterthan2nd• COCODetection: 11% betterthan2nd• COCOSegmentation: 12% betterthan2nd

*improvementsarerelativenumbers

RevolutionofDepth

6.7 7.3

25.828.2

ILSVRC'15ResNet

ILSVRC'14GoogleNet

ILSVRC'14VGG

ILSVRC'13 ILSVRC'12AlexNet

ILSVRC'11 ILSVRC'10

ImageNetClassificationtop-5error(%)

shallow8layers

19layers22layers

152layers

8layers

RevolutionofDepth11x11conv,96,/4,pool/2

5x5conv,256,pool/2

3x3conv,384

3x3conv,256,pool/2

fc,4096

fc,1000

AlexNet,8layers(ILSVRC2012)

RevolutionofDepth11x11conv,96,/4,pool/2

5x5conv,256,pool/2

3x3conv,384

3x3conv,256,pool/2

fc,4096

fc,1000

3x3conv,64

3x3conv,64,pool/2

3x3conv,128

3x3conv,128,pool/2

3x3conv,256

3x3conv,256,pool/2

3x3conv,512

3x3conv,512,pool/2

3x3conv,512

3x3conv,512,pool/2

fc,4096

fc,1000

VGG,19layers(ILSVRC2014)

Conv7x7+ 2(S)

MaxPool 3x3+ 2(S)

LocalRespNorm

Conv1x1+ 1(V)

Conv3x3+ 1(S)

LocalRespNorm

MaxPool 3x3+ 2(S)

Conv Conv Conv Conv1x1+ 1(S) 3x3+ 1(S) 5x5+ 1(S) 1x1+ 1(S)

Conv Conv MaxPool 1x1+ 1(S) 1x1+ 1(S) 3x3+ 1(S)

Dept hConcat

MaxPool 3x3+ 2(S)

Dept hConcat

AveragePool 5x5+ 3(V)

Dept hConcat

MaxPool 3x3+ 2(S)

Dept hConcat

Conv1x1+ 1(S)

Soft maxAct ivat ion

soft max0

Conv1x1+ 1(S)

soft max1

soft max2

GoogleNet,22layers(ILSVRC2014)

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,64

3x3conv,64

1x1conv,256

1x1conv,128,/2

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,128

3x3conv,128

1x1conv,512

1x1conv,256,/2

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,256

3x3conv,256

1x1conv,1024

1x1conv,512,/2

3x3conv,512

1x1conv,2048

1x1conv,512

3x3conv,512

1x1conv,2048

1x1conv,512

3x3conv,512

1x1conv,2048

avepool,fc1000

7x7conv,64,/2,pool/2

RevolutionofDepthResNet,152layers(ILSVRC2015)

3x3conv,64

3x3conv,64,pool/2

3x3conv,128

3x3conv,128,pool/2

3x3conv,256

3x3conv,256,pool/2

3x3conv,512

3x3conv,512,pool/2

3x3conv,512

3x3conv,512,pool/2

fc,4096

fc,1000

11x11conv,96,/4,pool/2

5x5conv,256,pool/2

3x3conv,384

3x3conv,256,pool/2

fc,4096

fc,1000

VGG,19layers(ILSVRC2014)

RevolutionofDepth

HOG,DPM AlexNet(RCNN)

VGG(RCNN)

ResNet(FasterRCNN)*

PASCALVOC2007ObjectDetectionmAP (%)

shallow8layers

16layers

101layers

*w/otherimprovements&moredata

Enginesofvisualrecognition

ResNet’s objectdetectionresultonCOCO*theoriginalimageisfromtheCOCOdataset

Verysimple,easytofollow

• Manythird-partyimplementations(listinhttps://github.com/KaimingHe/deep-residual-networks)• FacebookAIResearch’sTorchResNet:• Torch,CIFAR-10,withResNet-20toResNet-110,trainingcode,andcurves:code• Lasagne,CIFAR-10,withResNet-32andResNet-56andtrainingcode:code• Neon,CIFAR-10,withpre-trainedResNet-32toResNet-110models,trainingcode,andcurves: code• Torch,MNIST,100layers:blog,code• AwinningentryinKaggle's rightwhalerecognitionchallenge:blog,code• Neon,Place2(mini),40layers:blog,code• …

• Easilyreproducedresults(e.g.TorchResNet:https://github.com/facebook/fb.resnet.torch)

• Aseriesofextensionsandfollow-ups• >200citationsin6monthsafterpostedonarXiv (Dec.2015)

Background

Fromshallowtodeep

Traditionalrecognition

edges classifier “bus”?

pixelsclassifier “bus”?

histogram classifier “bus”?edges

SIFT/HOG

histogram classifier “bus”?edges K-means/sparsecode

shallower

deeper

Butwhat’snext?

DeepLearning

histogram classifier “bus”?edges K-means/sparsecode

Specializedcomponents, domainknowledge required

“bus”?

Genericcomponents (“layers”),lessdomainknowledge

“bus”?

Repeatelementarylayers=>Goingdeeper

• End-to-endlearning• Richersolutionspace

SpectrumofDepth

shallower deeper

5layers:easy

>10layers:initialization,BatchNormalization

>30layers:skipconnections

>100layers:identityskipconnections

>1000layers:?

Initialization

LeCun etal1998“EfficientBackprop”Glorot &Bengio 2010“Understandingthedifficultyoftrainingdeepfeedforwardneuralnetworks”

input𝑋

output𝑌 = 𝑊𝑋

weight𝑊

1-layer:𝑉𝑎𝑟 𝑦 = (𝑛+,𝑉𝑎𝑟 𝑤 )𝑉𝑎𝑟[𝑥]

Multi-layer:

𝑉𝑎𝑟 𝑦 = (2𝑛3+,𝑉𝑎𝑟 𝑤33

)𝑉𝑎𝑟[𝑥]

If:• Linearactivation• 𝑥, 𝑦,𝑤:independentThen:

𝑛+, 𝑛567

Initialization

1 3 5 7 9 11 13 15depth

exploding

vanishing

Forward:

𝑉𝑎𝑟 𝑦 = (2𝑛3+,𝑉𝑎𝑟 𝑤33

)𝑉𝑎𝑟[𝑥]

Backward:

𝑉𝑎𝑟𝜕𝜕𝑥 = (2𝑛3567𝑉𝑎𝑟 𝑤3

)𝑉𝑎𝑟[𝜕𝜕𝑦]

Bothforward(response)andbackward(gradient)signalcanvanish/explode

Initialization

• Initializationunderlinear assumption

∏ 𝑛3+,𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡>? (healthyforward)and

∏ 𝑛3567𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡@?(healthybackward)

𝑛3+,𝑉𝑎𝑟 𝑤3 = 1or*

𝑛3567𝑉𝑎𝑟 𝑤3 = 1

*:𝑛3567 = 𝑛3BC+, ,soD5,E7FGD5,E7HG= ,IJKL

,HPQKLRS < ∞.

Itissufficienttouseeitherform.

“Xavier”init inCaffe

Initialization

• InitializationunderReLU activation

∏ CV𝑛3

+,𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡>? (healthyforward)and

∏ CV𝑛3

567𝑉𝑎𝑟 𝑤33 = 𝑐𝑜𝑛𝑠𝑡@?(healthybackward)

12𝑛3

+,𝑉𝑎𝑟 𝑤3 = 1or

12𝑛3

567𝑉𝑎𝑟 𝑤3 = 1

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DelvingDeepintoRectifiers:SurpassingHuman-LevelPerformanceonImageNetClassification”.ICCV2015.

With𝐷 layers,afactorof2 perlayerhasexponentialimpactof2Y

“MSRA”init inCaffe

Initialization

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DelvingDeepintoRectifiers:SurpassingHuman-LevelPerformanceonImageNetClassification”.ICCV2015.

Xavier

22-layerReLU net:goodinit convergesfaster

𝑛𝑉𝑎𝑟 𝑤 = 1oursXavier

30-layerReLU net:goodinit isabletoconverge

12𝑛𝑉𝑎𝑟 𝑤 = 1

𝑛𝑉𝑎𝑟 𝑤 = 1

*Figuresshowthebeginningoftraining

BatchNormalization(BN)

• Normalizinginput(LeCun etal1998“EfficientBackprop”)

• BN:normalizingeachlayer,foreachmini-batch

• Greatlyacceleratetraining

• Lesssensitivetoinitialization

• Improveregularization

S.Ioffe &C.Szegedy.Batchnormalization:Acceleratingdeepnetworktrainingbyreducing internalcovariateshift.ICML2015

layer 𝑥 𝑥Z =𝑥 − 𝜇𝜎 𝑦 = 𝛾𝑥Z + 𝛽

• 𝜇:meanof𝑥 inmini-batch• 𝜎:std of𝑥 inmini-batch• 𝛾:scale• 𝛽:shift

• 𝜇,𝜎:functionsof𝑥,analogoustoresponses

• 𝛾, 𝛽:parameterstobelearned,analogoustoweights

layer 𝑥 𝑥Z =𝑥 − 𝜇𝜎 𝑦 = 𝛾𝑥Z + 𝛽

2modesofBN:• Trainmode:• 𝜇,𝜎 arefunctionsof𝑥;backprop gradients

• Testmode:• 𝜇,𝜎 arepre-computed*ontrainingset

*:byrunning average,orpost-processing aftertraining

Caution:makesureyourBNisinacorrectmode

Figure takenfrom[S.Ioffe &C.Szegedy]

w/oBNbestofw/BNaccuracy

DeepResidualNetworks

From10layersto100layers

GoingDeeper

• Initializationalgorithms✓• BatchNormalization✓

• Islearningbetternetworksassimpleasstackingmorelayers?

Simplystackinglayers?

0 1 2 3 4 5 60

iter. (1e4)

trainerror(%)

0 1 2 3 4 5 60

iter. (1e4)

testerror(%)CIFAR-10

56-layer

20-layer

56-layer

20-layer

• Plain nets:stacking3x3convlayers…• 56-layernethashighertrainingerror andtesterrorthan20-layernet

Simplystackinglayers?

0 1 2 3 4 5 60

iter. (1e4)

plain-20plain-32plain-44plain-56

CIFAR-10

20-layer32-layer44-layer56-layer

0 10 20 30 40 5020

iter. (1e4)

plain-18plain-34

ImageNet-1000

34-layer

18-layer

• “Overlydeep”plainnetshavehighertrainingerror• Ageneralphenomenon,observedinmanydatasets

solid:test/valdashed:train

7x7conv,64,/2

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,512,/2

3x3conv,512

fc1000

ashallowermodel

(18layers)

adeepercounterpart(34layers)

7x7conv,64,/2

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,512,/2

3x3conv,512

fc1000

“extra”layers

• Richersolutionspace

• Adeepermodelshouldnothavehighertrainingerror

• Asolutionbyconstruction:• originallayers:copiedfroma

learnedshallowermodel• extralayers:setasidentity• atleastthesametrainingerror

• Optimizationdifficulties:solverscannotfindthesolutionwhengoingdeeper…

DeepResidualLearning

• Plaintnet

anytwostackedlayers

𝐻(𝑥)

weightlayer

𝐻 𝑥 isanydesiredmapping,

hopethe2weightlayersfit𝐻(𝑥)

• Residual net

𝐻 𝑥 isanydesiredmapping,

hopethe2weightlayersfit𝐻(𝑥)

hope the2weightlayersfit𝐹(𝑥)

let𝐻 𝑥 = 𝐹 𝑥 + 𝑥weightlayer

weightlayer

𝐻 𝑥 = 𝐹 𝑥 + 𝑥

identity𝑥

𝐹(𝑥)

• 𝐹 𝑥 isaresidual mappingw.r.t.identity

• Ifidentitywereoptimal,easytosetweightsas0

• Ifoptimalmappingisclosertoidentity,easiertofindsmallfluctuations

weightlayer

𝐻 𝑥 = 𝐹 𝑥 + 𝑥

identity𝑥

𝐹(𝑥)

RelatedWorks– ResidualRepresentations

• VLAD&FisherVector[Jegou etal2010],[Perronnin etal2007]

• Encodingresidual vectors;powerfulshallowerrepresentations.

• ProductQuantization(IVF-ADC)[Jegou etal2011]

• Quantizingresidual vectors;efficientnearest-neighborsearch.

• MultiGrid&HierarchicalPrecondition[Briggs,etal2000],[Szeliski1990,2006]• Solvingresidual sub-problems;efficientPDEsolvers.

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,512,/2

3x3conv,512

avgpool

fc1000

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,512,/2

3x3conv,512

avgpool

fc1000

Network“Design”

• Keepitsimple

• Ourbasicdesign (VGG-style)• all3x3conv(almost)

• spatialsize/2=>#filtersx2(~samecomplexityperlayer)

• Simpledesign;justdeep!

• Otherremarks:• nohiddenfc• nodropout

plainnet ResNet

Training

• Allplain/residualnetsaretrainedfromscratch

• Allplain/residualnetsuseBatchNormalization

• Standardhyper-parameters&augmentation

CIFAR-10experiments

0 1 2 3 4 5 60

iter. (1e4)

plain-20plain-32plain-44plain-56

CIFAR-10plainnets

0 1 2 3 4 5 60

iter. (1e4)

ResNet-20ResNet-32ResNet-44ResNet-56ResNet-110

CIFAR-10ResNets

110-layer

• DeepResNetscanbetrainedwithoutdifficulties• DeeperResNetshavelowertrainingerror,andalsolowertesterror

solid:testdashed:train

ImageNetexperiments

0 10 20 30 40 5020

iter. (1e4)

ResNet-18ResNet-34

0 10 20 30 40 5020

iter. (1e4)

plain-18plain-34

ImageNetplainnets ImageNetResNets

34-layer

18-layer

34-layer

• DeepResNetscanbetrainedwithoutdifficulties• DeeperResNetshavelowertrainingerror,andalsolowertesterror

ImageNetexperiments

• Apracticaldesignofgoingdeeper

3x3,64

1x1,64relu

1x1,256relu

all-3x3

bottleneck(forResNet-50/101/152)

similarcomplexity

ImageNetexperiments7.4

6.15.7

ResNet-34ResNet-50ResNet-101ResNet-15210-crop testing,top-5val error(%)

thismodelhaslowertimecomplexity

thanVGG-16/19

• Deeper ResNetshavelower error

ImageNetexperiments

6.7 7.3

25.828.2

ILSVRC'15ResNet

ILSVRC'14GoogleNet

ILSVRC'14VGG

ILSVRC'13 ILSVRC'12AlexNet

ILSVRC'11 ILSVRC'10

ImageNetClassificationtop-5error(%)

shallow8layers

19layers22layers

152layers

8layers

DiscussionsRepresentation,Optimization,Generalization

Issuesonlearningdeepmodels

•Representation ability

•Optimization ability

•Generalization ability

• Abilityofmodeltofittrainingdata,ifoptimumcouldbefound

• IfmodelA’ssolutionspaceisasupersetofB’s,Ashouldbebetter.

• Feasibilityoffindinganoptimum

• Notallmodelsareequallyeasytooptimize

• Oncetrainingdataisfit,howgoodisthetestperformance

HowdoResNetsaddresstheseissues?

•Representation ability

•Optimization ability

•Generalization ability

• Noexplicitadvantageonrepresentation(onlyre-parameterization),but

• Allowmodelstogodeeper

• Enableverysmoothforward/backwardprop

• Greatlyeaseoptimizingdeeper models

• Notexplicitlyaddressgeneralization,but

• Deeper+thinner isgoodgeneralization

OntheImportanceofIdentityMapping

From100layersto1000layers

Onidentitymappingsforoptimization

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

𝑥cBC = 𝑓(ℎ 𝑥c + 𝐹 𝑥c )

𝑥𝑙

ℎ(𝑥c)𝐹(𝑥𝑙)layer

• shortcutmapping:ℎ =identity

• after-addmapping:𝑓 =ReLU

𝑥𝑙

• Whatif𝑓 =identity?

𝑥𝑙

• Whatif𝑓 =identity?

Verysmoothforwardpropagation

𝑥cBC = 𝑥c + 𝐹 𝑥c

𝑥cBV = 𝑥cBC + 𝐹 𝑥cBC

𝑥cBV = 𝑥c + 𝐹 𝑥c + 𝐹 𝑥cBC

𝑥g = 𝑥c +h𝐹 𝑥+

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,512,/2

3x3conv,512

avgpool

fc1000

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,512,/2

3x3conv,512

avgpool

fc1000

𝑥c• Any𝑥c isdirectly forward-proptoany𝑥g,plus residual.• Any𝑥g isanadditiveoutcome.• incontrastto multiplicative:𝑥g = ∏ 𝑊+𝑥cgiC

Verysmoothbackwardpropagation

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,512,/2

3x3conv,512

avgpool

fc1000

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,512,/2

3x3conv,512

avgpool

fc1000

𝜕𝐸𝜕𝑥g

𝜕𝐸𝜕𝑥c

=𝜕𝐸𝜕𝑥g

𝜕𝑥g𝜕𝑥c

=𝜕𝐸𝜕𝑥g

(1 +𝜕𝜕𝑥c

h𝐹 𝑥+

Verysmoothbackwardpropagation

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,512,/2

3x3conv,512

avgpool

fc1000

7x7conv,64,/2

pool,/2

3x3conv,64

3x3conv,128,/2

3x3conv,128

3x3conv,256,/2

3x3conv,256

3x3conv,512,/2

3x3conv,512

avgpool

fc1000

𝜕𝐸𝜕𝑥g

𝜕𝐸𝜕𝑥c

=𝜕𝐸𝜕𝑥g

(1 +𝜕𝜕𝑥c

h𝐹 𝑥+

• Any lmlno

isdirectly back-proptoanylmlnp,

plus residual.

• Anylmlnp

is additive;unlikelytovanish

• incontrastto multiplicative:lmlnp = ∏ 𝑊+lmlno

giC+jc

Residualforeverylayer

forward:

𝜕𝐸𝜕𝑥c

=𝜕𝐸𝜕𝑥g

(1 +𝜕𝜕𝑥c

h𝐹 𝑥+

)backward:

Enabledby:

• after-addmapping:𝑓 =identity

Experiments

• Set1:whatifshortcutmappingℎ ≠identity

• Set2:whatifafter-addmapping𝑓 isidentity

• ExperimentsonResNetswithmorethan100layers• deepermodelssuffermorefromoptimizationdifficulty

ExperimentSet1:whatifshortcutmappingℎ ≠identity?

(f) dropout shortcut(e) conv shortcut

3x3conv

additionReLU

1x1convReLU

3x3conv

addition

dropoutReLU

(d) shortcut-only gating(c) exclusive gating

3x3conv

addition

1x1convsigmoid

3x3conv

addition

1x1convsigmoid

(a) original (b) constant scaling

3x3conv

additionReLU

3x3conv

addition

0.5 0.5

ℎ 𝑥 = 𝑥error:6.6%

ℎ 𝑥 = 0.5𝑥error:12.4%

ℎ 𝑥 = gate ·𝑥error:8.7%

ℎ 𝑥 = conv(𝑥)error:12.2%

ℎ 𝑥 = dropout(𝑥)error:>20%

*ResNet-110onCIFAR-10

*similarto“HighwayNetwork”

(f) dropout shortcut(e) conv shortcut

3x3conv

additionReLU

1x1convReLU

3x3conv

addition

dropoutReLU

(d) shortcut-only gating(c) exclusive gating

3x3conv

addition

1x1convsigmoid

3x3conv

addition

1x1convsigmoid

(a) original (b) constant scaling

3x3conv

additionReLU

3x3conv

addition

0.5 0.5

ℎ 𝑥 = 𝑥error:6.6%

ℎ 𝑥 = 0.5𝑥error:12.4%

ℎ 𝑥 = conv(𝑥)error:12.2%

ℎ 𝑥 = dropout(𝑥)error:>20%

shortcutsblockedby

multiplications

*ResNet-110onCIFAR-10

Ifℎ ismultiplicative,e.g.ℎ 𝑥 = λ𝑥

𝑥g = λgic𝑥c +h𝐹u 𝑥+

forward:

𝜕𝐸𝜕𝑥c

=𝜕𝐸𝜕𝑥g

(λgic +𝜕𝜕𝑥c

h𝐹u 𝑥+

)backward:

• ifℎ ismultiplicative,shortcutsareblocked

• directpropagationisdecayed

*assuming𝑓 =identity

3x3conv

addition

1x1convsigmoid

3x3conv

additionReLU

ℎ isidentity

ℎ isgating

• gatingshouldhavebetterrepresentationability(identityisaspecialcase),but

• optimizationdifficultydominatesresultsKaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

ExperimentSet2:whatifafter-addmapping𝑓 isidentity

weight

addition

weight

addition

weight

addition

𝑓 isReLU(originalResNet)

𝑓 isBN+ReLU 𝑓 isidentity(pre-activation ResNet)

weight

addition

weight

addition

𝑓 = ReLU 𝑓 = BN+ReLU

𝑓 = ReLU

𝑓 = BN+ReLU

• BNcouldblockprop• Keeptheshortestpassas

smoothaspossible

weight

addition

𝑓 = ReLU 𝑓 = identity

weight

addition

1001-layer ResNetsonCIFAR-10

𝑓 = ReLU𝑓 = identity

• ReLUcouldblockpropwhenthereare1000layers

• pre-activationdesigneasesoptimization(andimprovesgeneralization;seepaper)

method error (%)

NIN 8.81

DSN 8.22

FitNet 8.39

Highway 7.72

ResNet-110 (1.7M) 6.61

ResNet-1202 (19.4M) 7.93

ResNet-164, pre-activation (1.7M) 5.46

ResNet-1001, pre-activation (10.2M) 4.92 (4 .89±0 .14 )

method error (%)NIN 35.68

DSN 34.57

FitNet 35.04

Highway 32.39

ResNet-164(1.7M) 25.16

ResNet-1001(10.2M) 27.82

ResNet-164, pre-activation (1.7M) 24.33

ResNet-1001, pre-activation (10.2M) 22.71 (22 .68±0 .22 )

ComparisonsonCIFAR-10/100CIFAR-10 CIFAR-100

*allbasedonmoderateaugmentation

ImageNetExperiments

method data augmentation top-1error(%) top-5error(%)ResNet-152, original scale 21.3 5.5ResNet-152, pre-activation scale 21.1 5.5ResNet-200, original scale 21.8 6.0ResNet-200, pre-activation scale 20.7 5.3ResNet-200, pre-activation scale+aspectratio 20.1* 4.8*

*independentlyreproducedby:https://github.com/facebook/fb.resnet.torch/tree/master/pretrained#notes

trainingcodeandmodelsavailable.

ImageNetsingle-crop(320x320)val error

Summaryofobservations

• Keeptheshortestpathassmoothaspossible• bymakingℎ and𝑓 identity• forward/backwardsignalsdirectlyflowthroughthispath

• Featuresofanylayersareadditiveoutcomes

• 1000-layer ResNetscanbeeasilytrainedandhavebetteraccuracy

weight

addition

FutureWorks

• Representation• skipping1layervs.multiplelayers?• Flatvs.Bottleneck?• Inception-ResNet [Szegedy etal2016]• ResNet inResNet [Targ etal2016]• Widthvs.Depth[Zagoruyko &Komodakis 2016]

• Generalization• DropOut,MaxOut,DropConnect,…• DropLayer(StochasticDepth)[Huangetal2016]

• Optimization• Withoutresidual/shortcut?

weight

addition

Applications

“Featuresmatter”

“Featuresmatter.”(quote[Girshicketal.2014],theR-CNNpaper)

task 2nd-placewinner ResNets margin

(relative)

ImageNetLocalization(top-5error) 12.0 9.0 27%

ImageNetDetection(mAP@.5) 53.6 62.1 16%

COCO Detection(mAP@.5:.95) 33.5 37.3 11%

COCOSegmentation(mAP@.5:.95) 25.1 28.2 12%

• OurresultsareallbasedonResNet-101• Deeperfeaturesarewelltransferrable

absolute8.5%better!

RevolutionofDepth

HOG,DPM AlexNet(RCNN)

VGG(RCNN)

ResNet(FasterRCNN)*

PASCALVOC2007ObjectDetectionmAP (%)

shallow8layers

16layers

101layers

*w/otherimprovements&moredata

Enginesofvisualrecognition

DeepLearningforComputerVision

backbonestructure

ImageNetdata

classificationnetwork

pre-train features

detectionnetwork

(e.g.R-CNN)

segmentationnetwork(e.g.FCN)

…...

humanposeestimationnetwork

depthestimationnetwork

targetdata

fine-tune

Example:ObjectDetection

ü boatü person

ImageClassification(what?)

ObjectDetection(what+where?)

ObjectDetection:R-CNN

regionproposals~2,000

1CNNforeachregion

Region-based CNNpipeline

figurecredit:R.Girshicketal.

aeroplane? no.

..person? yes.

tvmonitor? no.

warped region..

inputimage classifyregions

Girshick,Donahue, Darrell,Malik.RichFeatureHierarchiesforAccurateObjectDetectionandSemanticSegmentation.CVPR2014

ObjectDetection:R-CNN

• R-CNN

Girshick,Donahue, Darrell,Malik.RichFeatureHierarchiesforAccurateObjectDetectionandSemanticSegmentation.CVPR2014

feature

pre-computedRegions-of-Interest

(RoIs)

End-to-Endtraining

pre-computedRegions-of-Interest

(RoIs)

feature

featurefeature

ObjectDetection:FastR-CNN

• FastR-CNN

Girshick.FastR-CNN.ICCV2015

End-to-Endtraining

sharedconvlayers

RoI pooling

ObjectDetection:FasterR-CNN

• FasterR-CNN• SolelybasedonCNN• Noexternalmodules• Eachstepisend-to-end

End-to-Endtraining

featuremap

RegionProposalNet

proposals

features

RoI pooling

Shaoqing Ren,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.

ObjectDetection

backbonestructure

ImageNetdata

classificationnetwork

pre-train features

detectionnetwork

detectiondata

fine-tune

• AlexNet• VGG-16• GoogleNet• ResNet-101• …

• R-CNN• FastR-CNN• FasterR-CNN• MultiBox• SSD• …

“plug-in”features detectors

independentlydeveloped

ObjectDetection

• Simply“FasterR-CNN+ResNet”

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.

featuremap

RegionProposalNet

proposals

classifier

RoI pooling

FasterR-CNNbaseline mAP@.5 mAP@.5:.95

VGG-16 41.5 21.5ResNet-101 48.4 27.2

COCOdetection resultsResNet-101has28%relativegain

vsVGG-16

ObjectDetection

• RPNlearns proposalsbyextremelydeepnets• Weuseonly300proposals(nohand-designedproposals)

• Addcomponents:• Iterativelocalization• Contextmodeling• Multi-scaletesting

• AllarebasedonCNNfeatures;allareend-to-end

• Allbenefitmore fromdeeper features– cumulativegains!

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.

ResNet’s objectdetectionresultonCOCO*theoriginalimageisfromtheCOCOdataset

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.arXiv2015.ShaoqingRen,KaimingHe,RossGirshick,&JianSun.“FasterR-CNN:TowardsReal-TimeObjectDetectionwithRegionProposalNetworks”.NIPS2015.

*theoriginalimageisfromtheCOCOdataset

Resultsonrealvideo.ModelstrainedonMSCOCO(80categories).(frame-by-frame;notemporalprocessing)

thisvideoisavailableonline:https://youtu.be/WZmSMkK9VuA

MoreVisualRecognitionTasks

ResNet-basedmethodsleadonthesebenchmarks(incompletelist):• ImageNetclassification,detection,localization• MSCOCOdetection,segmentation• PASCALVOCdetection,segmentation

• Humanposeestimation[Newelletal2016]• Depthestimation[Laina etal2016]• Segmentproposal[Pinheiro etal2016]• …

PASCALdetectionleaderboard

PASCALsegmentationleaderboard

ResNet-101

PotentialApplications

ResNetshaveshownoutstandingorpromisingresultson:

VisualRecognition

ImageGeneration(PixelRNN,NeuralArt,etc.)

NaturalLanguageProcessing(VerydeepCNN)

SpeechRecognition(preliminaryresults)

Advertising,userprediction(preliminaryresults)

ConclusionsoftheTutorial

• DeepResidualLearning:• Ultradeepnetworkscanbeeasytotrain• Ultradeepnetworkscangainaccuracyfromdepth• Ultradeeprepresentationsarewelltransferrable• Now200 layersonImageNetand1000layersonCIFAR!

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

Resources

• ModelsandCode• OurImageNetmodelsinCaffe:https://github.com/KaimingHe/deep-residual-networks

• Manyavailableimplementation(listinhttps://github.com/KaimingHe/deep-residual-networks)

• FacebookAIResearch’sTorchResNet:https://github.com/facebook/fb.resnet.torch

• Torch,CIFAR-10,withResNet-20toResNet-110,trainingcode,andcurves:code• Lasagne,CIFAR-10,withResNet-32andResNet-56andtrainingcode:code• Neon,CIFAR-10,withpre-trainedResNet-32toResNet-110models,trainingcode,andcurves:code• Torch,MNIST,100layers:blog,code• AwinningentryinKaggle's rightwhalerecognitionchallenge:blog,code• Neon,Place2(mini),40layers:blog,code• …....

KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“DeepResidualLearningforImageRecognition”.CVPR2016.KaimingHe,XiangyuZhang,ShaoqingRen,&JianSun.“IdentityMappingsinDeepResidualNetworks”.arXiv2016.

Deep Residual Networks - ICML · Deep Residual Networks Deep Learning Gets Way Deeper 8:30-10:30am,...

Documents