+ All Categories
Home > Documents > Regularization of Neural Networks using...

Regularization of Neural Networks using...

Date post: 12-Jun-2018
Category:
Upload: volien
View: 215 times
Download: 0 times
Share this document with a friend
12
Regularization of Neural Networks using DropConnect Li Wan [email protected] Matthew Zeiler [email protected] Sixin Zhang [email protected] Yann LeCun [email protected] Rob Fergus [email protected] Dept. of Computer Science, Courant Institute of Mathematical Science, New York University Abstract We introduce DropConnect, a generalization of Dropout (Hinton et al., 2012), for regular- izing large fully-connected layers within neu- ral networks. When training with Dropout, a randomly selected subset of activations are set to zero within each layer. DropCon- nect instead sets a randomly selected sub- set of weights within the network to zero. Each unit thus receives input from a ran- dom subset of units in the previous layer. We derive a bound on the generalization per- formance of both Dropout and DropCon- nect. We then evaluate DropConnect on a range of datasets, comparing to Dropout, and show state-of-the-art results on several image recognition benchmarks by aggregating mul- tiple DropConnect-trained models. 1. Introduction Neural network (NN) models are well suited to do- mains where large labeled datasets are available, since their capacity can easily be increased by adding more layers or more units in each layer. However, big net- works with millions or billions of parameters can easily overfit even the largest of datasets. Correspondingly, a wide range of techniques for regularizing NNs have been developed. Adding an 2 penalty on the network weights is one simple but effective approach. Other forms of regularization include: Bayesian methods (Mackay, 1995), weight elimination (Weigend et al., 1991) and early stopping of training. In practice, us- ing these techniques when training big networks gives superior test performance to smaller networks trained without regularization. Proceedings of the 30 th International Conference on Ma- chine Learning, Atlanta, Georgia, USA, 2013. JMLR: W&CP volume 28. Copyright 2013 by the author(s). Recently, Hinton et al. proposed a new form of regular- ization called Dropout (Hinton et al., 2012). For each training example, forward propagation involves ran- domly deleting half the activations in each layer. The error is then backpropagated only through the remain- ing activations. Extensive experiments show that this significantly reduces over-fitting and improves test per- formance. Although a full understanding of its mech- anism is elusive, the intuition is that it prevents the network weights from collaborating with one another to memorize the training examples. In this paper, we propose DropConnect which general- izes Dropout by randomly dropping the weights rather than the activations. Like Dropout, the technique is suitable for fully connected layers only. We compare and contrast the two methods on four different image datasets. 2. Motivation To demonstrate our method we consider a fully con- nected layer of a neural network with input v = [v 1 ,v 2 ,...,v n ] T and weight parameters W (of size d × n). The output of this layer, r =[r 1 ,r 2 ,...,r d ] T is computed as a matrix multiply between the input vector and the weight matrix followed by a non-linear activation function, a, (biases are included in W with a corresponding fixed input of 1 for simplicity): r = a(u)= a(Wv) (1) 2.1. Dropout Dropout was proposed by (Hinton et al., 2012) as a form of regularization for fully connected neural network layers. Each element of a layer’s output is kept with probability p, otherwise being set to 0 with probability (1 - p). Extensive experiments show that Dropout improves the network’s generalization ability, giving improved test performance. When Dropout is applied to the outputs of a fully con-
Transcript
Page 1: Regularization of Neural Networks using …yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdfRegularization of Neural Networks using DropConnect DropConnect weights W (d x n) b) DropConnect

Regularization of Neural Networks using DropConnect

Li Wan wanlicsnyueduMatthew Zeiler zeilercsnyueduSixin Zhang zsxcsnyueduYann LeCun yanncsnyueduRob Fergus ferguscsnyuedu

Dept of Computer Science Courant Institute of Mathematical Science New York University

Abstract

We introduce DropConnect a generalizationof Dropout (Hinton et al 2012) for regular-izing large fully-connected layers within neu-ral networks When training with Dropouta randomly selected subset of activations areset to zero within each layer DropCon-nect instead sets a randomly selected sub-set of weights within the network to zeroEach unit thus receives input from a ran-dom subset of units in the previous layerWe derive a bound on the generalization per-formance of both Dropout and DropCon-nect We then evaluate DropConnect on arange of datasets comparing to Dropout andshow state-of-the-art results on several imagerecognition benchmarks by aggregating mul-tiple DropConnect-trained models

1 Introduction

Neural network (NN) models are well suited to do-mains where large labeled datasets are available sincetheir capacity can easily be increased by adding morelayers or more units in each layer However big net-works with millions or billions of parameters can easilyoverfit even the largest of datasets Correspondinglya wide range of techniques for regularizing NNs havebeen developed Adding an `2 penalty on the networkweights is one simple but effective approach Otherforms of regularization include Bayesian methods(Mackay 1995) weight elimination (Weigend et al1991) and early stopping of training In practice us-ing these techniques when training big networks givessuperior test performance to smaller networks trainedwithout regularization

Proceedings of the 30 th International Conference on Ma-chine Learning Atlanta Georgia USA 2013 JMLRWampCP volume 28 Copyright 2013 by the author(s)

Recently Hinton et al proposed a new form of regular-ization called Dropout (Hinton et al 2012) For eachtraining example forward propagation involves ran-domly deleting half the activations in each layer Theerror is then backpropagated only through the remain-ing activations Extensive experiments show that thissignificantly reduces over-fitting and improves test per-formance Although a full understanding of its mech-anism is elusive the intuition is that it prevents thenetwork weights from collaborating with one anotherto memorize the training examples

In this paper we propose DropConnect which general-izes Dropout by randomly dropping the weights ratherthan the activations Like Dropout the technique issuitable for fully connected layers only We compareand contrast the two methods on four different imagedatasets

2 Motivation

To demonstrate our method we consider a fully con-nected layer of a neural network with input v =[v1 v2 vn]T and weight parameters W (of sized times n) The output of this layer r = [r1 r2 rd]

T

is computed as a matrix multiply between the inputvector and the weight matrix followed by a non-linearactivation function a (biases are included in W witha corresponding fixed input of 1 for simplicity)

r = a(u) = a(Wv) (1)21 Dropout

Dropout was proposed by (Hinton et al 2012) asa form of regularization for fully connected neuralnetwork layers Each element of a layerrsquos output iskept with probability p otherwise being set to 0 withprobability (1 minus p) Extensive experiments show thatDropout improves the networkrsquos generalization abilitygiving improved test performance

When Dropout is applied to the outputs of a fully con-

Regularization of Neural Networks using DropConnect

DropConnect weights

W (d x n)

b) DropConnect mask M

Features v (n x 1)

u (d x 1)

a) Model Layout

Activation function

a(u)

Outputs r (d x 1)

Feature extractor g(xWg)

Input x

Softmax layer

s(rWs)

Predictions o (k x 1)

c) Effective Dropout mask Mrsquo

Previous layer mask

Cur

rent

laye

r out

put m

ask

Figure 1 (a) An example model layout for a single DropConnect layer After running feature extractor g() on input x arandom instantiation of the mask M (eg (b)) masks out the weight matrix W The masked weights are multiplied withthis feature vector to produce u which is the input to an activation function a and a softmax layer s For comparison (c)shows an effective weight mask for elements that Dropout uses when applied to the previous layerrsquos output (red columns)and this layerrsquos output (green rows) Note the lack of structure in (b) compared to (c)

nected layer we can write Eqn 1 as

r = m a(Wv) (2)

where denotes element wise product and m is a bi-nary mask vector of size d with each element j drawnindependently from mj sim Bernoulli(p)

Many commonly used activation functions such astanh centered sigmoid and relu (Nair and Hinton2010) have the property that a(0) = 0 Thus Eqn 2could be re-written as r = a(mWv) where Dropoutis applied at the inputs to the activation function

22 DropConnect

DropConnect is the generalization of Dropout in whicheach connection rather than each output unit canbe dropped with probability 1 minus p DropConnect issimilar to Dropout as it introduces dynamic sparsitywithin the model but differs in that the sparsity ison the weights W rather than the output vectors of alayer In other words the fully connected layer withDropConnect becomes a sparsely connected layer inwhich the connections are chosen at random duringthe training stage Note that this is not equivalent tosetting W to be a fixed sparse matrix during training

For a DropConnect layer the output is given as

r = a ((M W ) v) (3)

where M is a binary matrix encoding the connectioninformation and Mij sim Bernoulli(p) Each elementof the mask M is drawn independently for each exam-ple during training essentially instantiating a differ-ent connectivity for each example seen Additionally

the biases are also masked out during training FromEqn 2 and Eqn 3 it is evident that DropConnect isthe generalization of Dropout to the full connectionstructure of a layer1

The paper structure is as follows we outline details ontraining and running inference in a model using Drop-Connect in section 3 followed by theoretical justifica-tion for DropConnect in section 4 GPU implementa-tion specifics in section 5 and experimental results insection 6

3 Model Description

We consider a standard model architecture composedof four basic components (see Fig 1a)

1 Feature Extractor v = g(xWg) where v are the out-put features x is input data to the overall modeland Wg are parameters for the feature extractor Wechoose g() to be a multi-layered convolutional neuralnetwork (CNN) (LeCun et al 1998) with Wg beingthe convolutional filters (and biases) of the CNN

2 DropConnect Layer r = a(u) = a((M W )v) wherev is the output of the feature extractor W is a fullyconnected weight matrix a is a non-linear activationfunction and M is the binary mask matrix

3 Softmax Classification Layer o = s(rWs) takes asinput r and uses parameters Ws to map this to a kdimensional output (k being the number of classes)

4 Cross Entropy Loss A(y o) = minussumki=1 yilog(oi) takes

probabilities o and the ground truth labels y as input

1This holds when a(0) = 0 as is the case for tanh andrelu functions

Regularization of Neural Networks using DropConnect

The overall model f(x θM) therefore maps inputdata x to an output o through a sequence of operationsgiven the parameters θ = WgWWs and randomly-drawn mask M The correct value of o is obtained bysumming out over all possible masks M

o = EM [f(x θM)] =sumM

p(M)f(x θM) (4)

This reveals the mixture model interpretation of Drop-Connect (and Dropout) where the output is a mixtureof 2|M | different networks each with weight p(M)If p = 05 then these weights are equal and o =1|M |

sumM f(x θM) = 1

|M |sumM s(a((M W )v)Ws)

31 Training

Training the model described in Section 3 begins byselecting an example x from the training set and ex-tracting features for that example v These featuresare input to the DropConnect layer where a mask ma-trix M is first drawn from a Bernoulli(p) distributionto mask out elements of both the weight matrix andthe biases in the DropConnect layer A key compo-nent to successfully training with DropConnect is theselection of a different mask for each training exam-ple Selecting a single mask for a subset of trainingexamples such as a mini-batch of 128 examples doesnot regularize the model enough in practice Since thememory requirement for the M rsquos now grows with thesize of each mini-batch the implementation needs tobe carefully designed as described in Section 5

Once a mask is chosen it is applied to the weights andbiases in order to compute the input to the activa-tion function This results in r the input to the soft-max layer which outputs class predictions from whichcross entropy between the ground truth labels is com-puted The parameters throughout the model θ thencan be updated via stochastic gradient descent (SGD)by backpropagating gradients of the loss function withrespect to the parameters Aprimeθ To update the weightmatrix W in a DropConnect layer the mask is ap-plied to the gradient to update only those elementsthat were active in the forward pass Additionallywhen passing gradients down to the feature extractorthe masked weight matrix M W is used A summaryof these steps is provided in Algorithm 1

32 Inference

At inference time we need to compute r =1|M |

sumM a((M W )v) which naively requires the

evaluation of 2|M | different masks ndash plainly infeasible

The Dropout work (Hinton et al 2012) made the ap-proximation

sumM a((M W )v) asymp a(

sumM (M W )v)

Algorithm 1 SGD Training with DropConnect

Input example x parameters θtminus1 from step tminus1learning rate ηOutput updated parameters θtForward PassExtract features v larr g(xWg)Random sample M mask Mij sim Bernoulli(p)Compute activations r = a((M W )v)Compute output o = s(rWs)Backpropagate GradientsDifferentiate loss Aprimeθ with respect to parameters θUpdate softmax layer Ws = Ws minus ηAprimeWs

Update DropConnect layer W = W minus η(M AprimeW )Update feature extractor Wg = Wg minus ηAprimeWg

Algorithm 2 Inference with DropConnect

Input example x parameters θ of samples ZOutput prediction uExtract features v larr g(xWg)Moment matching of umicrolarr EM [u] σ2 larr VM [u]

for z = 1 Z do Draw Z samplesfor i = 1 d do Loop over units in r

Sample from 1D Gaussian uiz sim N (microi σ2i )

riz larr a(uiz)end for

end forPass result r =

sumZz=1 rzZ to next layer

ie averaging before the activation rather than afterAlthough this seems to work in practice it is not jus-tified mathematically particularly for the relu activa-tion function2

We take a different approach Consider a singleunit ui before the activation function a() ui =sumj(Wijvj)Mij This is a weighted sum of Bernoulli

variables Mij which can be approximated by a Gaus-sian via moment matching The mean and varianceof the units u are EM [u] = pWv and VM [u] =p(1 minus p)(W W )(v v) We can then draw samplesfrom this Gaussian and pass them through the activa-tion function a() before averaging them and present-ing them to the next layer Algorithm 2 summarizesthe method Note that the sampling can be done ef-ficiently since the samples for each unit and exam-ple can be drawn in parallel This scheme is only anapproximation in the case of multi-layer network itworks well in practise as shown in Experiments

2Consider u sim N(0 1) with a(u) = max(u 0)

a(EM (u)) = 0 but EM (a(u)) = 1radic

2π asymp 04

Regularization of Neural Networks using DropConnect

Implementation Mask Weight Time(ms) Speedupfprop bprop acts bprop weights total

CPU float 4802 12286 16928 34016 10 timesCPU bit 3923 6791 7597 18311 19 timesGPU float(global memory) 216 62 72 350 972 timesGPU float(tex1D memory) 151 61 60 272 1260 timesGPU bit(tex2D aligned memory) 24 27 31 82 4148 timesGPU(Lower Bound) cuBlas + read mask weight 03 03 02 08

Table 1 Performance comparison between different implementations of our DropConnect layer on NVidia GTX580 GPUrelative to a 267Ghz Intel Xeon (compiled with -O3 flag) Input dimension and Output dimension are 1024 and mini-batchsize is 128 As reference we provide traditional matrix multiplication using the cuBlas library

4 Model Generalization BoundWe now show a novel bound for the Rademacher com-plexity of the model R`(F) on the training set (seeappendix for derivation)

R`(F) le p(

2radickdBsn

radicdBh

)R`(G) (5)

where max|Ws| le Bs max|W | le B k is the num-ber of classes R`(G) is the Rademacher complexity ofthe feature extractor n and d are the dimensionalityof the input and output of the DropConnect layer re-spectively The important result from Eqn 5 is thatthe complexity is a linear function of the probability pof an element being kept in DropConnect or DropoutWhen p = 0 the model complexity is zero since theinput has no influence on the output When p = 1 itreturns to the complexity of a standard model

5 Implementation Details

Our system involves three components implementedon a GPU 1) a feature extractor 2) our DropConnectlayer and 3) a softmax classification layer For 1 and3 we utilize the Cuda-convnet package (Krizhevsky2012) a fast GPU based convolutional network libraryWe implement a custom GPU kernel for performingthe operations within the DropConnect layer Ourcode is available at httpcsnyuedu~wanli

dropc

A typical fully connected layer is implemented as amatrix-matrix multiplication between the input vec-tors for a mini-batch of training examples and theweight matrix The difficulty in our case is that eachtraining example requires itrsquos own random mask ma-trix applied to the weights and biases of the DropCon-nect layer This leads to several complications

1 For a weight matrix of size d times n the correspondingmask matrix is of size dtimes ntimes b where b is the size ofthe mini-batch For a 4096times4096 fully connected layerwith mini-batch size of 128 the matrix would be toolarge to fit into GPU memory if each element is storedas a floating point number requiring 8G of memory

2 Once a random instantiation of the mask is created itis non-trivial to access all the elements required duringthe matrix multiplications so as to maximize perfor-mance

The first problem is not hard to address Each ele-ment of the mask matrix is stored as a single bit toencode the connectivity information rather than as afloat The memory cost is thus reduced by 32 timeswhich becomes 256M for the example above This notonly reduces the memory footprint but also reducesthe bandwidth required as 32 elements can be accessedwith each 4-byte read We overcome the second prob-lem using an efficient memory access pattern using 2Dtexture aligned memory These two improvements arecrucial for an efficient GPU implementation of Drop-Connect as shown in Table 1 Here we compare to anaive CPU implementation with floating point masksand get a 415times speedup with our efficient GPU design

6 Experiments

We evaluate our DropConnect model for regularizingdeep neural networks trained for image classificationAll experiments use mini-batch SGD with momentumon batches of 128 images with the momentum param-eter fixed at 09

We use the following protocol for all experiments un-less otherwise stated

bull Augment the dataset by 1) randomly selectingcropped regions from the images 2) flipping imageshorizontally 3) introducing 15 scaling and rotationvariationsbull Train 5 independent networks with random permuta-

tions of the training sequencebull Manually decrease the learning rate if the network

stops improving as in (Krizhevsky 2012) according toa schedule determined on a validation setbull Train the fully connected layer using Dropout Drop-

Connect or neither (No-Drop)bull At inference time for DropConnect we draw Z = 1000

Regularization of Neural Networks using DropConnect

samples at the inputs to the activation function of thefully connected layer and average their activations

To anneal the initial learning rate we choose a fixedmultiplier for different stages of training We reportthree numbers of epochs such as 600-400-200 to defineour schedule We multiply the initial rate by 1 for thefirst such number of epochs Then we use a multiplierof 05 for the second number of epochs followed by01 again for this second number of epochs The thirdnumber of epochs is used for multipliers of 005 0010005 and 0001 in that order after which point wereport our results We determine the epochs to use forour schedule using a validation set to look for plateausin the loss function at which point we move to thenext multiplier 3

Once the 5 networks are trained we report two num-bers 1) the mean and standard deviation of the classi-fication errors produced by each of the 5 independentnetworks and 2) the classification error that resultswhen averaging the output probabilities from the 5networks before making a prediction We find in prac-tice this voting scheme inspired by (Ciresan et al2012) provides significant performance gains achiev-ing state-of-the-art results in many standard bench-marks when combined with our DropConnect layer

61 MNIST

The MNIST handwritten digit classification task (Le-Cun et al 1998) consists of 28times28 black and white im-ages each containing a digit 0 to 9 (10-classes) Eachdigit in the 60 000 training images and 10 000 testimages is normalized to fit in a 20times20 pixel box whilepreserving their aspect ratio We scale the pixel valuesto the [0 1] range before inputting to our models

For our first experiment on this dataset we train mod-els with two fully connected layers each with 800 out-put units using either tanh sigmoid or relu activationfunctions to compare to Dropout in (Hinton et al2012) The first layer takes the image pixels as inputwhile the second layerrsquos output is fed into a 10-classsoftmax classification layer In Table 2 we show theperformance of various activations functions compar-ing No-Drop Dropout and DropConnect in the fullyconnected layers No data augmentation is utilized inthis experiment We use an initial learning rate of 01and train for 600-400-20 epochs using our schedule

From Table 2 we can see that both Dropout and Drop-

3In all experiments the bias learning rate is 2times thelearning rate for the weights Additionally weights are ini-tialized with N(0 01) random values for fully connectedlayers and N(0 001) for convolutional layers

neuron model error()5 network

votingerror()

relu No-Drop 162plusmn0037 140Dropout 128plusmn0040 120DropConnect 120plusmn0034 112

sigmoid No-Drop 178plusmn0037 174Dropout 138plusmn0039 136DropConnect 155plusmn0046 148

tanh No-Drop 165plusmn0026 149Dropout 158plusmn0053 155DropConnect 136plusmn0054 135

Table 2 MNIST classification error rate for models withtwo fully connected layers of 800 neurons each No dataaugmentation is used in this experiment

Connect perform better than not using either methodDropConnect mostly performs better than Dropout inthis task with the gap widening when utilizing thevoting over the 5 models

To further analyze the effects of DropConnect weshow three explanatory experiments in Fig 2 using a 2-layer fully connected model on MNIST digits Fig 2ashows test performance as the number of hidden unitsin each layer varies As the model size increases No-Drop overfits while both Dropout and DropConnectimprove performance DropConnect consistently givesa lower error rate than Dropout Fig 2b shows the ef-fect of varying the drop rate p for Dropout and Drop-Connect for a 400-400 unit network Both methodsgive optimal performance in the vicinity of 05 thevalue used in all other experiments in the paper Oursampling approach gives a performance gain over meaninference (as used by Hinton (Hinton et al 2012))but only for the DropConnect case In Fig 2c weplot the convergence properties of the three methodsthroughout training on a 400-400 network We cansee that No-Drop overfits quickly while Dropout andDropConnect converge slowly to ultimately give supe-rior test performance DropConnect is even slower toconverge than Dropout but yields a lower test errorin the end

In order to improve our classification result we choosea more powerful feature extractor network described in(Ciresan et al 2012) (relu is used rather than tanh)This feature extractor consists of a 2 layer CNN with32-64 feature maps in each layer respectively Thelast layerrsquos output is treated as input to the fully con-nected layer which has 150 relu units on which No-Drop Dropout or DropConnect are applied We re-port results in Table 3 from training the network ona) the original MNIST digits b) cropped 24 times 24 im-ages from random locations and c) rotated and scaledversions of these cropped images We use an initial

Regularization of Neural Networks using DropConnect

200 400 800 160011

12

13

14

15

16

17

18

19

2

21

Hidden Units

T

est E

rror

NominusDropDropoutDropConnect

0 01 02 03 04 05 06 07 08 0912

14

16

18

2

22

24

of Elements Kept

T

est E

rror

Dropout (mean)DropConnect (mean)Dropout (sampling)DropConnect (sampling)

100 200 300 400 500 600 700 800 90010minus3

10minus2

Epoch

Cro

ss E

ntro

py

NominusDrop TrainNominusDrop TestDropout TrainDropout TestDropConnect TrainDropConnect Test

Figure 2 Using the MNIST dataset in a) we analyze the ability of Dropout and DropConnect to prevent overfittingas the size of the 2 fully connected layers increase b) Varying the drop-rate in a 400-400 network shows near optimalperformance around the p = 05 proposed by (Hinton et al 2012) c) we show the convergence properties of the traintestsets See text for discussion

learning rate of 001 with a 700-200-100 epoch sched-ule no momentum and preprocess by subtracting theimage mean

crop rotationscaling

model error()5 network

votingerror()

no no No-Drop 077plusmn0051 067Dropout 059plusmn0039 052DropConnect 063plusmn0035 057

yes no No-Drop 050plusmn0098 038Dropout 039plusmn0039 035DropConnect 039plusmn0047 032

yes yes No-Drop 030plusmn0035 021Dropout 028plusmn0016 027DropConnect 028plusmn0032 021

Table 3 MNIST classification error Previous state of theart is 0 47 (Zeiler and Fergus 2013) for a single modelwithout elastic distortions and 023 with elastic distor-tions and voting (Ciresan et al 2012)

We note that our approach surpasses the state-of-the-art result of 023 (Ciresan et al 2012) achieving a021 error rate without the use of elastic distortions(as used by (Ciresan et al 2012))

62 CIFAR-10

CIFAR-10 is a data set of natural 32x32 RGB images(Krizhevsky 2009) in 10-classes with 50 000 imagesfor training and 10 000 for testing Before inputtingthese images to our network we subtract the per-pixelmean computed over the training set from each image

The first experiment on CIFAR-10 (summarized inTable 4) uses the simple convolutional network fea-ture extractor described in (Krizhevsky 2012)(layers-80seccfg) that is designed for rapid training ratherthan optimal performance On top of the 3-layerfeature extractor we have a 64 unit fully connectedlayer which uses No-Drop Dropout or DropConnectNo data augmentation is utilized for this experiment

Since this experiment is not aimed at optimal perfor-mance we report a single modelrsquos performance with-out voting We train for 150-0-0 epochs with an ini-tial learning rate of 0001 and their default weight de-cay DropConnect prevents overfitting of the fully con-nected layer better than Dropout in this experiment

model error()No-Drop 235Dropout 197DropConnect 187

Table 4 CIFAR-10 classification error using the simplefeature extractor described in (Krizhevsky 2012)(layers-80seccfg) and with no data augmentation

Table 5 shows classification results of the network us-ing a larger feature extractor with 2 convolutionallayers and 2 locally connected layers as describedin (Krizhevsky 2012)(layers-conv-local-11pctcfg) A128 neuron fully connected layer with relu activationsis added between the softmax layer and feature extrac-tor Following (Krizhevsky 2012) images are croppedto 24x24 with horizontal flips and no rotation or scal-ing is performed We use an initial learning rate of0001 and train for 700-300-50 epochs with their de-fault weight decay Model voting significantly im-proves performance when using Dropout or DropCon-nect the latter reaching an error rate of 941 Ad-ditionally we trained a model with 12 networks withDropConnect and achieved a state-of-the-art result of932 indicating the power of our approach

63 SVHN

The Street View House Numbers (SVHN) dataset in-cludes 604 388 images (both training set and extra set)and 26 032 testing images (Netzer et al 2011) Simi-lar to MNIST the goal is to classify the digit centeredin each 32x32 RGB image Due to the large variety ofcolors and brightness variations in the images we pre-

Regularization of Neural Networks using DropConnect

model error() 5 network votingerror()

No-Drop 1118plusmn 013 1022Dropout 1152plusmn 018 983DropConnect 1110plusmn 013 941

Table 5 CIFAR-10 classification error using a larger fea-ture extractor Previous state-of-the-art is 95 (Snoeket al 2012) Voting with 12 DropConnect networks pro-duces an error rate of 932 significantly beating thestate-of-the-art

process the images using local contrast normalizationas in (Zeiler and Fergus 2013) The feature extractoris the same as the larger CIFAR-10 experiment butwe instead use a larger 512 unit fully connected layerwith relu activations between the softmax layer andthe feature extractor After contrast normalizing thetraining data is randomly cropped to 28 times 28 pixelsand is rotated and scaled We do not do horizontalflips Table 6 shows the classification performance for5 models trained with an initial learning rate of 0001for a 100-50-10 epoch schedule

Due to the large training set size both Dropout andDropConnect achieve nearly the same performance asNo-Drop However using our data augmentation tech-niques and careful annealing the per model scores eas-ily surpass the previous 280 state-of-the-art resultof (Zeiler and Fergus 2013) Furthermore our vot-ing scheme reduces the relative error of the previousstate-of-to-art by 30 to achieve 194 error

model error() 5 network votingerror()

No-Drop 226plusmn 0072 194Dropout 225plusmn 0034 196DropConnect 223plusmn 0039 194

Table 6 SVHN classification error The previous state-of-the-art is 28 (Zeiler and Fergus 2013)

64 NORB

In the final experiment we evaluate our models onthe 2-fold NORB (jittered-cluttered) dataset (LeCunet al 2004) a collection of stereo images of 3D mod-els For each image one of 6 classes appears on arandom background We train on 2-folds of 29 160images each and the test on a total of 58 320 imagesThe images are downsampled from 108times108 to 48times48as in (Ciresan et al 2012)

We use the same feature extractor as the largerCIFAR-10 experiment There is a 512 unit fully con-nected layer with relu activations placed between thesoftmax layer and feature extractor Rotation andscaling of the training data is applied but we do notcrop or flip the images as we found that to hurt per-

model error()5 network

votingerror()

No-Drop 448plusmn 078 336Dropout 396plusmn 016 303DropConnect 414plusmn 006 323

Table 7 NORM classification error for the jittered-cluttered dataset using 2 training folds The previousstate-of-art is 357 (Ciresan et al 2012)

formance on this dataset We trained with an initiallearning rate of 0001 and anneal for 100-40-10 epochs

In this experiment we beat the previous state-of-the-art result of 357 using No-Drop Dropout and Drop-Connect with our voting scheme While Dropout sur-passes DropConnect slightly both methods improveover No-Drop in this benchmark as shown in Table 7

7 Discussion

We have presented DropConnect which generalizesHinton et al rsquos Dropout (Hinton et al 2012) to the en-tire connectivity structure of a fully connected neuralnetwork layer We provide both theoretical justifica-tion and empirical results to show that DropConnecthelps regularize large neural network models Resultson a range of datasets show that DropConnect oftenoutperforms Dropout While our current implementa-tion of DropConnect is slightly slower than No-Drop orDropout in large models models the feature extractoris the bottleneck thus there is little difference in over-all training time DropConnect allows us to train largemodels while avoiding overfitting This yields state-of-the-art results on a variety of standard benchmarksusing our efficient GPU implementation of DropCon-nect

8 Appendix

81 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define the DropConnect networkas a mixture model o =

sumM p(M)f(x θM) =

EM [f(x θM)]

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters W are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more M is the DropConnect layermask

Now we reformulate the cross-entropy loss on top ofthe softmax into a single parameter function that com-bines the softmax output and labels as a logistic

Regularization of Neural Networks using DropConnect

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call the logisticloss function

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties 1) Ay(0) = ln k 2) minus1 le Aprimey(o) le 1 and3)Aprimeprimey(o) ge 0

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on setX and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

82 Bound Derivation

Lemma 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions and H = [Fj ]kj=1 be a k-

dimensional function class If A Rk rarr R is a Lips-chitz function with constant L and satisfies A(0) = 0then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1`

sum`i=1Ayi(oi) + 2kR`(F) + 3

radicln(2δ)

2`

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parametrized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof R`(HG) = Eσ[suphisinHgisinG

∣∣∣ 2` sum`i=1 σih g(xi)

∣∣∣]= Eσ

[supgisinGWleB

∣∣∣langW 2`

sum`i=1 σig(xi)

rang∣∣∣]le BEσ

[supfjisinF

∥∥∥∥[ 2` sum`i=1 σ

ji fj(xi)

]dj=1

∥∥∥∥]= BradicdEσ

[supfisinF

∣∣∣ 2` sum`i=1 σif(xi)

∣∣∣] =radicdBR`(F)

Remark 1 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 This

layer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on M then R`(EM [FM ]) le EM

[R`(FM )

]Proof R`(EM [FM ]) = R`

(sumM p (m)FM

)lesum

M R`(p(m)FM ) lesumM |p(m)|R`(FM ) = EM

[R`(FM )

]

Theorem 1 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ]) le EM[R`(f(x θM)

](6)

le (radicdkBs)

radicdEM

[R`(a hM g)

](7)

= 2radickdBsEM

[R`(hM g)

](8)

where hM = (M W )v Equation (6) is based onLemma 6 Equation (7) is based on Lemma 5 andEquation (8) follows from Lemma 4

EM[R`(hM g)

]= EMσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiWTDMg(xi)

∣∣∣∣∣]

(9)

= EMσ

[sup

hisinHgisinG

∣∣∣∣∣langDMW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxWDMW

]Eσ

supgjisinG

∥∥∥∥∥∥[

2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(10)

le Bhpradicnd(radic

nR`(G))

= pnradicdBhR`(G)

where DM in Equation (9) is an diagonal matrixwith diagonal elements equal to m and inner prod-uct properties lead to Equation (10) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Regularization of Neural Networks using DropConnect

References

D Ciresan U Meier and J Schmidhuber Multi-column deep neural networks for image classifica-tion In Proceedings of the 2012 IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR) CVPR rsquo12 pages 3642ndash3649 WashingtonDC USA 2012 IEEE Computer Society ISBN 978-1-4673-1226-4

G E Hinton N Srivastava A KrizhevskyI Sutskever and R Salakhutdinov Improving neu-ral networks by preventing co-adaptation of featuredetectors CoRR abs12070580 2012

A Krizhevsky Learning Multiple Layers of Featuresfrom Tiny Images Masterrsquos thesis University ofToront 2009

A Krizhevsky cuda-convnet httpcodegoogle

compcuda-convnet 2012

Y LeCun L Bottou Y Bengio and P HaffnerGradient-based learning applied to document recog-nition Proceedings of the IEEE 86(11)2278 ndash2324nov 1998 ISSN 0018-9219 doi 1011095726791

Y LeCun F J Huang and L Bottou Learning meth-ods for generic object recognition with invariance topose and lighting In Proceedings of the 2004 IEEEcomputer society conference on Computer vision andpattern recognition CVPRrsquo04 pages 97ndash104 Wash-ington DC USA 2004 IEEE Computer Society

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

D J C Mackay Probable networks and plausiblepredictions - a review of practical bayesian methodsfor supervised neural networks In Bayesian methodsfor backpropagation networks Springer 1995

V Nair and G E Hinton Rectified Linear Units Im-prove Restricted Boltzmann Machines In ICML2010

Y Netzer T Wang Coates A A Bissacco B Wuand A Y Ng Reading digits in natural images withunsupervised feature learning In NIPS Workshopon Deep Learning and Unsupervised Feature Learn-ing 2011 2011

J Snoek H Larochelle and R A Adams Practi-cal bayesian optimization of machine learning algo-rithms In Neural Information Processing Systems2012

A S Weigend D E Rumelhart and B A HubermanGeneralization by weight-elimination with applica-tion to forecasting In NIPS 1991

M D Zeiler and R Fergus Stochastic pooling forregualization of deep convolutional neural networksIn ICLR 2013

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

Regularization of Neural Networks using DropConnectSupplementary Material

1 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define DropConnect network as amixture model

o =summ

p(M)f(x θM) = Em [f(x θM)] (1)

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters w are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more m is the DropConnect layermask

Remark 1 when each element of Mi has equal proba-bility of being on and off (p = 05) the mixture modelhas equal weights for all sub-models f(x θM) oth-erwise the mixture model has larger weights in somesub-models than others

Reformulate cross-entropy loss on top of soft-max intoa single parameter function that combines soft-maxoutput and labels Same as logistic

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call logistic lossfunction

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties

1 Ay(0) = ln k2 minus1 le Aprimey(o) le 13 Aprimeprimey(o) ge 0

The first one says A(0) is depend on some constantrelated with number of labels The second one says Ais Lipschitz function with L = 1 The third one saysA is a convex function wrt x

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on set

X and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

Theorem 1 ((Koltchinskii and Panchenko 2000))Fix δ isin (0 1) and let F be a class of functions mappingfrom M to [0 1] Let (Mi)

`iminus1 be drawn independently

according to a probability distribution D Then withprobability at least 1minusδ over random draws of samplesof size ` every f isin F satisfies

E [f(M)] le E [f(M)] +R`(F) +

radicln (2δ)

2`

le E [f(M)] + R`(F) + 3

radicln (2δ)

2`

2 Bound Derivation

Theorem 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions If A R rarr R is Lipschitzwith constant L and satisfies A(0) = 0 then R`(A F ) le 2LR(F)

Lemma 2 Let F be class of real functions and H =[Fj ]kj=1 be a k-dimensional function class If A Rk rarrR is a Lipschitz function with constant L and satisfiesA(0) = 0 then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1

`

sumi=1

Ayi(oi) + 2kR`(F) + 3

radicln (2δ)

2`

Proof From Lemma 1 Logistic loss function (A minusc)(x) isin A due to (A minus c)prime(x) le 1 and (A minus c)(0) = 0with some constant c By Lemma 2 R`((Aminusc)F) le2kR`(F)

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Regularization of Neural Networks using DropConnect Supplementary Material

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parameterized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof

R`(H G)

= Eσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σih g(xi)

∣∣∣∣∣]

= Eσ

[sup

gisinGWleB

∣∣∣∣∣langW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le BEσ

supfjisinF

∥∥∥∥∥∥[

2

`

sumi=1

σji fj(xi)

]dj=1

∥∥∥∥∥∥

= BradicdEσ

[supfisinF

∣∣∣∣∣2` sumi=1

σif(xi)

∣∣∣∣∣]

=radicdBR`(F)

Remark 2 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 Thislayer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on m then R`(EM [FM ]) le EM

[R`(FM )

]Proof

R`(EM [FM ]) = R`

(sumM

p (M)FM

)lesumM

R`(p(M)FM )

lesumM

|p(M)|R`(FM ) = EM

[R`(FM )

]because of common fact 1) R`(cF) = |c|R`(F) and 2)R`(sumi Fi) le

sumi R`(Fi)

Theorem 3 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ])

le EM[R`(f(x θM)

](2)

= EM[R`(s a hm g)

]le (

radicdkBs)

radicdEM

[R`(a hm g)

](3)

= 2radickdBsEM

[R`(hm g)

](4)

where hm = (M W )v Equation (2) is based onLemma 6 Equation (3) is based on Lemma 5 andEquation (4) follows from Lemma 4

EM[R`(hm g)

]= Emσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiwTDMg(xi)

∣∣∣∣∣]

(5)

= Emσ

[sup

hisinHgisinG

∣∣∣∣∣langDMw

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxwDMw

]Eσ

supgjisinG

∥∥∥∥∥∥[2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(6)

le Bhpradicnd(radic

nR`(G))= pn

radicdBhR`(G)

where DM in Equation (5) is an diagonal matrix withdiagonal elements equal to m and inner product prop-erties lead to Equation (6) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Remark 3 Theorem 3 implies that p is an additionalregularizer we have added to network when we converta normal neural network to a network with DropCon-nect layers Consider the following extreme cases

1 p = 0 the network generalization bound equals to0 which is true because classifier does not depends oninput any more

2 p = 1 reduce to normal network

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Regularization of Neural Networks using DropConnect Supplementary Material

Symbol Description Related Formulay Data Label can either be integer label for bit vec-

tor(depends on context)x Network input datag() Feature extractor function with parameter Wg

v Feature extractor network output v = g(xWg)M DropConnect connection information parameter

(weight mask)h() DropConnect transformation function with parame-

ter WMu DropConnect output u = h(vWM)a() DropConnect activation functionr DropConnect after activation r = a(u)s() Dimension reduction layer function with parameter

Ws

o Dimension reduction layer output (network output) o = s(rWs)θ All parameter of network expect weight mask θ = WsWWgf() Overall classifier(network) output o = f(x θM)λ Weight penaltyA() Data Loss Function A(ominus y)L() Over all objective function L(x y) =

sumiA(oi minus yi) + 12λW22

n Dimension of feature extractor outputd Dimension of DropConnect layer outputk number of class dim(y) = k

Table 1 Symbol Table

References

V Koltchinskii and D Panchenko Empirical margindistributions and bounding the generalization errorof combined classifiers Annals of Statistics 3020022000

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

Page 2: Regularization of Neural Networks using …yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdfRegularization of Neural Networks using DropConnect DropConnect weights W (d x n) b) DropConnect

Regularization of Neural Networks using DropConnect

DropConnect weights

W (d x n)

b) DropConnect mask M

Features v (n x 1)

u (d x 1)

a) Model Layout

Activation function

a(u)

Outputs r (d x 1)

Feature extractor g(xWg)

Input x

Softmax layer

s(rWs)

Predictions o (k x 1)

c) Effective Dropout mask Mrsquo

Previous layer mask

Cur

rent

laye

r out

put m

ask

Figure 1 (a) An example model layout for a single DropConnect layer After running feature extractor g() on input x arandom instantiation of the mask M (eg (b)) masks out the weight matrix W The masked weights are multiplied withthis feature vector to produce u which is the input to an activation function a and a softmax layer s For comparison (c)shows an effective weight mask for elements that Dropout uses when applied to the previous layerrsquos output (red columns)and this layerrsquos output (green rows) Note the lack of structure in (b) compared to (c)

nected layer we can write Eqn 1 as

r = m a(Wv) (2)

where denotes element wise product and m is a bi-nary mask vector of size d with each element j drawnindependently from mj sim Bernoulli(p)

Many commonly used activation functions such astanh centered sigmoid and relu (Nair and Hinton2010) have the property that a(0) = 0 Thus Eqn 2could be re-written as r = a(mWv) where Dropoutis applied at the inputs to the activation function

22 DropConnect

DropConnect is the generalization of Dropout in whicheach connection rather than each output unit canbe dropped with probability 1 minus p DropConnect issimilar to Dropout as it introduces dynamic sparsitywithin the model but differs in that the sparsity ison the weights W rather than the output vectors of alayer In other words the fully connected layer withDropConnect becomes a sparsely connected layer inwhich the connections are chosen at random duringthe training stage Note that this is not equivalent tosetting W to be a fixed sparse matrix during training

For a DropConnect layer the output is given as

r = a ((M W ) v) (3)

where M is a binary matrix encoding the connectioninformation and Mij sim Bernoulli(p) Each elementof the mask M is drawn independently for each exam-ple during training essentially instantiating a differ-ent connectivity for each example seen Additionally

the biases are also masked out during training FromEqn 2 and Eqn 3 it is evident that DropConnect isthe generalization of Dropout to the full connectionstructure of a layer1

The paper structure is as follows we outline details ontraining and running inference in a model using Drop-Connect in section 3 followed by theoretical justifica-tion for DropConnect in section 4 GPU implementa-tion specifics in section 5 and experimental results insection 6

3 Model Description

We consider a standard model architecture composedof four basic components (see Fig 1a)

1 Feature Extractor v = g(xWg) where v are the out-put features x is input data to the overall modeland Wg are parameters for the feature extractor Wechoose g() to be a multi-layered convolutional neuralnetwork (CNN) (LeCun et al 1998) with Wg beingthe convolutional filters (and biases) of the CNN

2 DropConnect Layer r = a(u) = a((M W )v) wherev is the output of the feature extractor W is a fullyconnected weight matrix a is a non-linear activationfunction and M is the binary mask matrix

3 Softmax Classification Layer o = s(rWs) takes asinput r and uses parameters Ws to map this to a kdimensional output (k being the number of classes)

4 Cross Entropy Loss A(y o) = minussumki=1 yilog(oi) takes

probabilities o and the ground truth labels y as input

1This holds when a(0) = 0 as is the case for tanh andrelu functions

Regularization of Neural Networks using DropConnect

The overall model f(x θM) therefore maps inputdata x to an output o through a sequence of operationsgiven the parameters θ = WgWWs and randomly-drawn mask M The correct value of o is obtained bysumming out over all possible masks M

o = EM [f(x θM)] =sumM

p(M)f(x θM) (4)

This reveals the mixture model interpretation of Drop-Connect (and Dropout) where the output is a mixtureof 2|M | different networks each with weight p(M)If p = 05 then these weights are equal and o =1|M |

sumM f(x θM) = 1

|M |sumM s(a((M W )v)Ws)

31 Training

Training the model described in Section 3 begins byselecting an example x from the training set and ex-tracting features for that example v These featuresare input to the DropConnect layer where a mask ma-trix M is first drawn from a Bernoulli(p) distributionto mask out elements of both the weight matrix andthe biases in the DropConnect layer A key compo-nent to successfully training with DropConnect is theselection of a different mask for each training exam-ple Selecting a single mask for a subset of trainingexamples such as a mini-batch of 128 examples doesnot regularize the model enough in practice Since thememory requirement for the M rsquos now grows with thesize of each mini-batch the implementation needs tobe carefully designed as described in Section 5

Once a mask is chosen it is applied to the weights andbiases in order to compute the input to the activa-tion function This results in r the input to the soft-max layer which outputs class predictions from whichcross entropy between the ground truth labels is com-puted The parameters throughout the model θ thencan be updated via stochastic gradient descent (SGD)by backpropagating gradients of the loss function withrespect to the parameters Aprimeθ To update the weightmatrix W in a DropConnect layer the mask is ap-plied to the gradient to update only those elementsthat were active in the forward pass Additionallywhen passing gradients down to the feature extractorthe masked weight matrix M W is used A summaryof these steps is provided in Algorithm 1

32 Inference

At inference time we need to compute r =1|M |

sumM a((M W )v) which naively requires the

evaluation of 2|M | different masks ndash plainly infeasible

The Dropout work (Hinton et al 2012) made the ap-proximation

sumM a((M W )v) asymp a(

sumM (M W )v)

Algorithm 1 SGD Training with DropConnect

Input example x parameters θtminus1 from step tminus1learning rate ηOutput updated parameters θtForward PassExtract features v larr g(xWg)Random sample M mask Mij sim Bernoulli(p)Compute activations r = a((M W )v)Compute output o = s(rWs)Backpropagate GradientsDifferentiate loss Aprimeθ with respect to parameters θUpdate softmax layer Ws = Ws minus ηAprimeWs

Update DropConnect layer W = W minus η(M AprimeW )Update feature extractor Wg = Wg minus ηAprimeWg

Algorithm 2 Inference with DropConnect

Input example x parameters θ of samples ZOutput prediction uExtract features v larr g(xWg)Moment matching of umicrolarr EM [u] σ2 larr VM [u]

for z = 1 Z do Draw Z samplesfor i = 1 d do Loop over units in r

Sample from 1D Gaussian uiz sim N (microi σ2i )

riz larr a(uiz)end for

end forPass result r =

sumZz=1 rzZ to next layer

ie averaging before the activation rather than afterAlthough this seems to work in practice it is not jus-tified mathematically particularly for the relu activa-tion function2

We take a different approach Consider a singleunit ui before the activation function a() ui =sumj(Wijvj)Mij This is a weighted sum of Bernoulli

variables Mij which can be approximated by a Gaus-sian via moment matching The mean and varianceof the units u are EM [u] = pWv and VM [u] =p(1 minus p)(W W )(v v) We can then draw samplesfrom this Gaussian and pass them through the activa-tion function a() before averaging them and present-ing them to the next layer Algorithm 2 summarizesthe method Note that the sampling can be done ef-ficiently since the samples for each unit and exam-ple can be drawn in parallel This scheme is only anapproximation in the case of multi-layer network itworks well in practise as shown in Experiments

2Consider u sim N(0 1) with a(u) = max(u 0)

a(EM (u)) = 0 but EM (a(u)) = 1radic

2π asymp 04

Regularization of Neural Networks using DropConnect

Implementation Mask Weight Time(ms) Speedupfprop bprop acts bprop weights total

CPU float 4802 12286 16928 34016 10 timesCPU bit 3923 6791 7597 18311 19 timesGPU float(global memory) 216 62 72 350 972 timesGPU float(tex1D memory) 151 61 60 272 1260 timesGPU bit(tex2D aligned memory) 24 27 31 82 4148 timesGPU(Lower Bound) cuBlas + read mask weight 03 03 02 08

Table 1 Performance comparison between different implementations of our DropConnect layer on NVidia GTX580 GPUrelative to a 267Ghz Intel Xeon (compiled with -O3 flag) Input dimension and Output dimension are 1024 and mini-batchsize is 128 As reference we provide traditional matrix multiplication using the cuBlas library

4 Model Generalization BoundWe now show a novel bound for the Rademacher com-plexity of the model R`(F) on the training set (seeappendix for derivation)

R`(F) le p(

2radickdBsn

radicdBh

)R`(G) (5)

where max|Ws| le Bs max|W | le B k is the num-ber of classes R`(G) is the Rademacher complexity ofthe feature extractor n and d are the dimensionalityof the input and output of the DropConnect layer re-spectively The important result from Eqn 5 is thatthe complexity is a linear function of the probability pof an element being kept in DropConnect or DropoutWhen p = 0 the model complexity is zero since theinput has no influence on the output When p = 1 itreturns to the complexity of a standard model

5 Implementation Details

Our system involves three components implementedon a GPU 1) a feature extractor 2) our DropConnectlayer and 3) a softmax classification layer For 1 and3 we utilize the Cuda-convnet package (Krizhevsky2012) a fast GPU based convolutional network libraryWe implement a custom GPU kernel for performingthe operations within the DropConnect layer Ourcode is available at httpcsnyuedu~wanli

dropc

A typical fully connected layer is implemented as amatrix-matrix multiplication between the input vec-tors for a mini-batch of training examples and theweight matrix The difficulty in our case is that eachtraining example requires itrsquos own random mask ma-trix applied to the weights and biases of the DropCon-nect layer This leads to several complications

1 For a weight matrix of size d times n the correspondingmask matrix is of size dtimes ntimes b where b is the size ofthe mini-batch For a 4096times4096 fully connected layerwith mini-batch size of 128 the matrix would be toolarge to fit into GPU memory if each element is storedas a floating point number requiring 8G of memory

2 Once a random instantiation of the mask is created itis non-trivial to access all the elements required duringthe matrix multiplications so as to maximize perfor-mance

The first problem is not hard to address Each ele-ment of the mask matrix is stored as a single bit toencode the connectivity information rather than as afloat The memory cost is thus reduced by 32 timeswhich becomes 256M for the example above This notonly reduces the memory footprint but also reducesthe bandwidth required as 32 elements can be accessedwith each 4-byte read We overcome the second prob-lem using an efficient memory access pattern using 2Dtexture aligned memory These two improvements arecrucial for an efficient GPU implementation of Drop-Connect as shown in Table 1 Here we compare to anaive CPU implementation with floating point masksand get a 415times speedup with our efficient GPU design

6 Experiments

We evaluate our DropConnect model for regularizingdeep neural networks trained for image classificationAll experiments use mini-batch SGD with momentumon batches of 128 images with the momentum param-eter fixed at 09

We use the following protocol for all experiments un-less otherwise stated

bull Augment the dataset by 1) randomly selectingcropped regions from the images 2) flipping imageshorizontally 3) introducing 15 scaling and rotationvariationsbull Train 5 independent networks with random permuta-

tions of the training sequencebull Manually decrease the learning rate if the network

stops improving as in (Krizhevsky 2012) according toa schedule determined on a validation setbull Train the fully connected layer using Dropout Drop-

Connect or neither (No-Drop)bull At inference time for DropConnect we draw Z = 1000

Regularization of Neural Networks using DropConnect

samples at the inputs to the activation function of thefully connected layer and average their activations

To anneal the initial learning rate we choose a fixedmultiplier for different stages of training We reportthree numbers of epochs such as 600-400-200 to defineour schedule We multiply the initial rate by 1 for thefirst such number of epochs Then we use a multiplierof 05 for the second number of epochs followed by01 again for this second number of epochs The thirdnumber of epochs is used for multipliers of 005 0010005 and 0001 in that order after which point wereport our results We determine the epochs to use forour schedule using a validation set to look for plateausin the loss function at which point we move to thenext multiplier 3

Once the 5 networks are trained we report two num-bers 1) the mean and standard deviation of the classi-fication errors produced by each of the 5 independentnetworks and 2) the classification error that resultswhen averaging the output probabilities from the 5networks before making a prediction We find in prac-tice this voting scheme inspired by (Ciresan et al2012) provides significant performance gains achiev-ing state-of-the-art results in many standard bench-marks when combined with our DropConnect layer

61 MNIST

The MNIST handwritten digit classification task (Le-Cun et al 1998) consists of 28times28 black and white im-ages each containing a digit 0 to 9 (10-classes) Eachdigit in the 60 000 training images and 10 000 testimages is normalized to fit in a 20times20 pixel box whilepreserving their aspect ratio We scale the pixel valuesto the [0 1] range before inputting to our models

For our first experiment on this dataset we train mod-els with two fully connected layers each with 800 out-put units using either tanh sigmoid or relu activationfunctions to compare to Dropout in (Hinton et al2012) The first layer takes the image pixels as inputwhile the second layerrsquos output is fed into a 10-classsoftmax classification layer In Table 2 we show theperformance of various activations functions compar-ing No-Drop Dropout and DropConnect in the fullyconnected layers No data augmentation is utilized inthis experiment We use an initial learning rate of 01and train for 600-400-20 epochs using our schedule

From Table 2 we can see that both Dropout and Drop-

3In all experiments the bias learning rate is 2times thelearning rate for the weights Additionally weights are ini-tialized with N(0 01) random values for fully connectedlayers and N(0 001) for convolutional layers

neuron model error()5 network

votingerror()

relu No-Drop 162plusmn0037 140Dropout 128plusmn0040 120DropConnect 120plusmn0034 112

sigmoid No-Drop 178plusmn0037 174Dropout 138plusmn0039 136DropConnect 155plusmn0046 148

tanh No-Drop 165plusmn0026 149Dropout 158plusmn0053 155DropConnect 136plusmn0054 135

Table 2 MNIST classification error rate for models withtwo fully connected layers of 800 neurons each No dataaugmentation is used in this experiment

Connect perform better than not using either methodDropConnect mostly performs better than Dropout inthis task with the gap widening when utilizing thevoting over the 5 models

To further analyze the effects of DropConnect weshow three explanatory experiments in Fig 2 using a 2-layer fully connected model on MNIST digits Fig 2ashows test performance as the number of hidden unitsin each layer varies As the model size increases No-Drop overfits while both Dropout and DropConnectimprove performance DropConnect consistently givesa lower error rate than Dropout Fig 2b shows the ef-fect of varying the drop rate p for Dropout and Drop-Connect for a 400-400 unit network Both methodsgive optimal performance in the vicinity of 05 thevalue used in all other experiments in the paper Oursampling approach gives a performance gain over meaninference (as used by Hinton (Hinton et al 2012))but only for the DropConnect case In Fig 2c weplot the convergence properties of the three methodsthroughout training on a 400-400 network We cansee that No-Drop overfits quickly while Dropout andDropConnect converge slowly to ultimately give supe-rior test performance DropConnect is even slower toconverge than Dropout but yields a lower test errorin the end

In order to improve our classification result we choosea more powerful feature extractor network described in(Ciresan et al 2012) (relu is used rather than tanh)This feature extractor consists of a 2 layer CNN with32-64 feature maps in each layer respectively Thelast layerrsquos output is treated as input to the fully con-nected layer which has 150 relu units on which No-Drop Dropout or DropConnect are applied We re-port results in Table 3 from training the network ona) the original MNIST digits b) cropped 24 times 24 im-ages from random locations and c) rotated and scaledversions of these cropped images We use an initial

Regularization of Neural Networks using DropConnect

200 400 800 160011

12

13

14

15

16

17

18

19

2

21

Hidden Units

T

est E

rror

NominusDropDropoutDropConnect

0 01 02 03 04 05 06 07 08 0912

14

16

18

2

22

24

of Elements Kept

T

est E

rror

Dropout (mean)DropConnect (mean)Dropout (sampling)DropConnect (sampling)

100 200 300 400 500 600 700 800 90010minus3

10minus2

Epoch

Cro

ss E

ntro

py

NominusDrop TrainNominusDrop TestDropout TrainDropout TestDropConnect TrainDropConnect Test

Figure 2 Using the MNIST dataset in a) we analyze the ability of Dropout and DropConnect to prevent overfittingas the size of the 2 fully connected layers increase b) Varying the drop-rate in a 400-400 network shows near optimalperformance around the p = 05 proposed by (Hinton et al 2012) c) we show the convergence properties of the traintestsets See text for discussion

learning rate of 001 with a 700-200-100 epoch sched-ule no momentum and preprocess by subtracting theimage mean

crop rotationscaling

model error()5 network

votingerror()

no no No-Drop 077plusmn0051 067Dropout 059plusmn0039 052DropConnect 063plusmn0035 057

yes no No-Drop 050plusmn0098 038Dropout 039plusmn0039 035DropConnect 039plusmn0047 032

yes yes No-Drop 030plusmn0035 021Dropout 028plusmn0016 027DropConnect 028plusmn0032 021

Table 3 MNIST classification error Previous state of theart is 0 47 (Zeiler and Fergus 2013) for a single modelwithout elastic distortions and 023 with elastic distor-tions and voting (Ciresan et al 2012)

We note that our approach surpasses the state-of-the-art result of 023 (Ciresan et al 2012) achieving a021 error rate without the use of elastic distortions(as used by (Ciresan et al 2012))

62 CIFAR-10

CIFAR-10 is a data set of natural 32x32 RGB images(Krizhevsky 2009) in 10-classes with 50 000 imagesfor training and 10 000 for testing Before inputtingthese images to our network we subtract the per-pixelmean computed over the training set from each image

The first experiment on CIFAR-10 (summarized inTable 4) uses the simple convolutional network fea-ture extractor described in (Krizhevsky 2012)(layers-80seccfg) that is designed for rapid training ratherthan optimal performance On top of the 3-layerfeature extractor we have a 64 unit fully connectedlayer which uses No-Drop Dropout or DropConnectNo data augmentation is utilized for this experiment

Since this experiment is not aimed at optimal perfor-mance we report a single modelrsquos performance with-out voting We train for 150-0-0 epochs with an ini-tial learning rate of 0001 and their default weight de-cay DropConnect prevents overfitting of the fully con-nected layer better than Dropout in this experiment

model error()No-Drop 235Dropout 197DropConnect 187

Table 4 CIFAR-10 classification error using the simplefeature extractor described in (Krizhevsky 2012)(layers-80seccfg) and with no data augmentation

Table 5 shows classification results of the network us-ing a larger feature extractor with 2 convolutionallayers and 2 locally connected layers as describedin (Krizhevsky 2012)(layers-conv-local-11pctcfg) A128 neuron fully connected layer with relu activationsis added between the softmax layer and feature extrac-tor Following (Krizhevsky 2012) images are croppedto 24x24 with horizontal flips and no rotation or scal-ing is performed We use an initial learning rate of0001 and train for 700-300-50 epochs with their de-fault weight decay Model voting significantly im-proves performance when using Dropout or DropCon-nect the latter reaching an error rate of 941 Ad-ditionally we trained a model with 12 networks withDropConnect and achieved a state-of-the-art result of932 indicating the power of our approach

63 SVHN

The Street View House Numbers (SVHN) dataset in-cludes 604 388 images (both training set and extra set)and 26 032 testing images (Netzer et al 2011) Simi-lar to MNIST the goal is to classify the digit centeredin each 32x32 RGB image Due to the large variety ofcolors and brightness variations in the images we pre-

Regularization of Neural Networks using DropConnect

model error() 5 network votingerror()

No-Drop 1118plusmn 013 1022Dropout 1152plusmn 018 983DropConnect 1110plusmn 013 941

Table 5 CIFAR-10 classification error using a larger fea-ture extractor Previous state-of-the-art is 95 (Snoeket al 2012) Voting with 12 DropConnect networks pro-duces an error rate of 932 significantly beating thestate-of-the-art

process the images using local contrast normalizationas in (Zeiler and Fergus 2013) The feature extractoris the same as the larger CIFAR-10 experiment butwe instead use a larger 512 unit fully connected layerwith relu activations between the softmax layer andthe feature extractor After contrast normalizing thetraining data is randomly cropped to 28 times 28 pixelsand is rotated and scaled We do not do horizontalflips Table 6 shows the classification performance for5 models trained with an initial learning rate of 0001for a 100-50-10 epoch schedule

Due to the large training set size both Dropout andDropConnect achieve nearly the same performance asNo-Drop However using our data augmentation tech-niques and careful annealing the per model scores eas-ily surpass the previous 280 state-of-the-art resultof (Zeiler and Fergus 2013) Furthermore our vot-ing scheme reduces the relative error of the previousstate-of-to-art by 30 to achieve 194 error

model error() 5 network votingerror()

No-Drop 226plusmn 0072 194Dropout 225plusmn 0034 196DropConnect 223plusmn 0039 194

Table 6 SVHN classification error The previous state-of-the-art is 28 (Zeiler and Fergus 2013)

64 NORB

In the final experiment we evaluate our models onthe 2-fold NORB (jittered-cluttered) dataset (LeCunet al 2004) a collection of stereo images of 3D mod-els For each image one of 6 classes appears on arandom background We train on 2-folds of 29 160images each and the test on a total of 58 320 imagesThe images are downsampled from 108times108 to 48times48as in (Ciresan et al 2012)

We use the same feature extractor as the largerCIFAR-10 experiment There is a 512 unit fully con-nected layer with relu activations placed between thesoftmax layer and feature extractor Rotation andscaling of the training data is applied but we do notcrop or flip the images as we found that to hurt per-

model error()5 network

votingerror()

No-Drop 448plusmn 078 336Dropout 396plusmn 016 303DropConnect 414plusmn 006 323

Table 7 NORM classification error for the jittered-cluttered dataset using 2 training folds The previousstate-of-art is 357 (Ciresan et al 2012)

formance on this dataset We trained with an initiallearning rate of 0001 and anneal for 100-40-10 epochs

In this experiment we beat the previous state-of-the-art result of 357 using No-Drop Dropout and Drop-Connect with our voting scheme While Dropout sur-passes DropConnect slightly both methods improveover No-Drop in this benchmark as shown in Table 7

7 Discussion

We have presented DropConnect which generalizesHinton et al rsquos Dropout (Hinton et al 2012) to the en-tire connectivity structure of a fully connected neuralnetwork layer We provide both theoretical justifica-tion and empirical results to show that DropConnecthelps regularize large neural network models Resultson a range of datasets show that DropConnect oftenoutperforms Dropout While our current implementa-tion of DropConnect is slightly slower than No-Drop orDropout in large models models the feature extractoris the bottleneck thus there is little difference in over-all training time DropConnect allows us to train largemodels while avoiding overfitting This yields state-of-the-art results on a variety of standard benchmarksusing our efficient GPU implementation of DropCon-nect

8 Appendix

81 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define the DropConnect networkas a mixture model o =

sumM p(M)f(x θM) =

EM [f(x θM)]

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters W are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more M is the DropConnect layermask

Now we reformulate the cross-entropy loss on top ofthe softmax into a single parameter function that com-bines the softmax output and labels as a logistic

Regularization of Neural Networks using DropConnect

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call the logisticloss function

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties 1) Ay(0) = ln k 2) minus1 le Aprimey(o) le 1 and3)Aprimeprimey(o) ge 0

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on setX and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

82 Bound Derivation

Lemma 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions and H = [Fj ]kj=1 be a k-

dimensional function class If A Rk rarr R is a Lips-chitz function with constant L and satisfies A(0) = 0then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1`

sum`i=1Ayi(oi) + 2kR`(F) + 3

radicln(2δ)

2`

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parametrized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof R`(HG) = Eσ[suphisinHgisinG

∣∣∣ 2` sum`i=1 σih g(xi)

∣∣∣]= Eσ

[supgisinGWleB

∣∣∣langW 2`

sum`i=1 σig(xi)

rang∣∣∣]le BEσ

[supfjisinF

∥∥∥∥[ 2` sum`i=1 σ

ji fj(xi)

]dj=1

∥∥∥∥]= BradicdEσ

[supfisinF

∣∣∣ 2` sum`i=1 σif(xi)

∣∣∣] =radicdBR`(F)

Remark 1 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 This

layer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on M then R`(EM [FM ]) le EM

[R`(FM )

]Proof R`(EM [FM ]) = R`

(sumM p (m)FM

)lesum

M R`(p(m)FM ) lesumM |p(m)|R`(FM ) = EM

[R`(FM )

]

Theorem 1 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ]) le EM[R`(f(x θM)

](6)

le (radicdkBs)

radicdEM

[R`(a hM g)

](7)

= 2radickdBsEM

[R`(hM g)

](8)

where hM = (M W )v Equation (6) is based onLemma 6 Equation (7) is based on Lemma 5 andEquation (8) follows from Lemma 4

EM[R`(hM g)

]= EMσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiWTDMg(xi)

∣∣∣∣∣]

(9)

= EMσ

[sup

hisinHgisinG

∣∣∣∣∣langDMW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxWDMW

]Eσ

supgjisinG

∥∥∥∥∥∥[

2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(10)

le Bhpradicnd(radic

nR`(G))

= pnradicdBhR`(G)

where DM in Equation (9) is an diagonal matrixwith diagonal elements equal to m and inner prod-uct properties lead to Equation (10) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Regularization of Neural Networks using DropConnect

References

D Ciresan U Meier and J Schmidhuber Multi-column deep neural networks for image classifica-tion In Proceedings of the 2012 IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR) CVPR rsquo12 pages 3642ndash3649 WashingtonDC USA 2012 IEEE Computer Society ISBN 978-1-4673-1226-4

G E Hinton N Srivastava A KrizhevskyI Sutskever and R Salakhutdinov Improving neu-ral networks by preventing co-adaptation of featuredetectors CoRR abs12070580 2012

A Krizhevsky Learning Multiple Layers of Featuresfrom Tiny Images Masterrsquos thesis University ofToront 2009

A Krizhevsky cuda-convnet httpcodegoogle

compcuda-convnet 2012

Y LeCun L Bottou Y Bengio and P HaffnerGradient-based learning applied to document recog-nition Proceedings of the IEEE 86(11)2278 ndash2324nov 1998 ISSN 0018-9219 doi 1011095726791

Y LeCun F J Huang and L Bottou Learning meth-ods for generic object recognition with invariance topose and lighting In Proceedings of the 2004 IEEEcomputer society conference on Computer vision andpattern recognition CVPRrsquo04 pages 97ndash104 Wash-ington DC USA 2004 IEEE Computer Society

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

D J C Mackay Probable networks and plausiblepredictions - a review of practical bayesian methodsfor supervised neural networks In Bayesian methodsfor backpropagation networks Springer 1995

V Nair and G E Hinton Rectified Linear Units Im-prove Restricted Boltzmann Machines In ICML2010

Y Netzer T Wang Coates A A Bissacco B Wuand A Y Ng Reading digits in natural images withunsupervised feature learning In NIPS Workshopon Deep Learning and Unsupervised Feature Learn-ing 2011 2011

J Snoek H Larochelle and R A Adams Practi-cal bayesian optimization of machine learning algo-rithms In Neural Information Processing Systems2012

A S Weigend D E Rumelhart and B A HubermanGeneralization by weight-elimination with applica-tion to forecasting In NIPS 1991

M D Zeiler and R Fergus Stochastic pooling forregualization of deep convolutional neural networksIn ICLR 2013

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

Regularization of Neural Networks using DropConnectSupplementary Material

1 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define DropConnect network as amixture model

o =summ

p(M)f(x θM) = Em [f(x θM)] (1)

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters w are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more m is the DropConnect layermask

Remark 1 when each element of Mi has equal proba-bility of being on and off (p = 05) the mixture modelhas equal weights for all sub-models f(x θM) oth-erwise the mixture model has larger weights in somesub-models than others

Reformulate cross-entropy loss on top of soft-max intoa single parameter function that combines soft-maxoutput and labels Same as logistic

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call logistic lossfunction

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties

1 Ay(0) = ln k2 minus1 le Aprimey(o) le 13 Aprimeprimey(o) ge 0

The first one says A(0) is depend on some constantrelated with number of labels The second one says Ais Lipschitz function with L = 1 The third one saysA is a convex function wrt x

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on set

X and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

Theorem 1 ((Koltchinskii and Panchenko 2000))Fix δ isin (0 1) and let F be a class of functions mappingfrom M to [0 1] Let (Mi)

`iminus1 be drawn independently

according to a probability distribution D Then withprobability at least 1minusδ over random draws of samplesof size ` every f isin F satisfies

E [f(M)] le E [f(M)] +R`(F) +

radicln (2δ)

2`

le E [f(M)] + R`(F) + 3

radicln (2δ)

2`

2 Bound Derivation

Theorem 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions If A R rarr R is Lipschitzwith constant L and satisfies A(0) = 0 then R`(A F ) le 2LR(F)

Lemma 2 Let F be class of real functions and H =[Fj ]kj=1 be a k-dimensional function class If A Rk rarrR is a Lipschitz function with constant L and satisfiesA(0) = 0 then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1

`

sumi=1

Ayi(oi) + 2kR`(F) + 3

radicln (2δ)

2`

Proof From Lemma 1 Logistic loss function (A minusc)(x) isin A due to (A minus c)prime(x) le 1 and (A minus c)(0) = 0with some constant c By Lemma 2 R`((Aminusc)F) le2kR`(F)

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Regularization of Neural Networks using DropConnect Supplementary Material

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parameterized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof

R`(H G)

= Eσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σih g(xi)

∣∣∣∣∣]

= Eσ

[sup

gisinGWleB

∣∣∣∣∣langW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le BEσ

supfjisinF

∥∥∥∥∥∥[

2

`

sumi=1

σji fj(xi)

]dj=1

∥∥∥∥∥∥

= BradicdEσ

[supfisinF

∣∣∣∣∣2` sumi=1

σif(xi)

∣∣∣∣∣]

=radicdBR`(F)

Remark 2 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 Thislayer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on m then R`(EM [FM ]) le EM

[R`(FM )

]Proof

R`(EM [FM ]) = R`

(sumM

p (M)FM

)lesumM

R`(p(M)FM )

lesumM

|p(M)|R`(FM ) = EM

[R`(FM )

]because of common fact 1) R`(cF) = |c|R`(F) and 2)R`(sumi Fi) le

sumi R`(Fi)

Theorem 3 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ])

le EM[R`(f(x θM)

](2)

= EM[R`(s a hm g)

]le (

radicdkBs)

radicdEM

[R`(a hm g)

](3)

= 2radickdBsEM

[R`(hm g)

](4)

where hm = (M W )v Equation (2) is based onLemma 6 Equation (3) is based on Lemma 5 andEquation (4) follows from Lemma 4

EM[R`(hm g)

]= Emσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiwTDMg(xi)

∣∣∣∣∣]

(5)

= Emσ

[sup

hisinHgisinG

∣∣∣∣∣langDMw

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxwDMw

]Eσ

supgjisinG

∥∥∥∥∥∥[2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(6)

le Bhpradicnd(radic

nR`(G))= pn

radicdBhR`(G)

where DM in Equation (5) is an diagonal matrix withdiagonal elements equal to m and inner product prop-erties lead to Equation (6) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Remark 3 Theorem 3 implies that p is an additionalregularizer we have added to network when we converta normal neural network to a network with DropCon-nect layers Consider the following extreme cases

1 p = 0 the network generalization bound equals to0 which is true because classifier does not depends oninput any more

2 p = 1 reduce to normal network

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Regularization of Neural Networks using DropConnect Supplementary Material

Symbol Description Related Formulay Data Label can either be integer label for bit vec-

tor(depends on context)x Network input datag() Feature extractor function with parameter Wg

v Feature extractor network output v = g(xWg)M DropConnect connection information parameter

(weight mask)h() DropConnect transformation function with parame-

ter WMu DropConnect output u = h(vWM)a() DropConnect activation functionr DropConnect after activation r = a(u)s() Dimension reduction layer function with parameter

Ws

o Dimension reduction layer output (network output) o = s(rWs)θ All parameter of network expect weight mask θ = WsWWgf() Overall classifier(network) output o = f(x θM)λ Weight penaltyA() Data Loss Function A(ominus y)L() Over all objective function L(x y) =

sumiA(oi minus yi) + 12λW22

n Dimension of feature extractor outputd Dimension of DropConnect layer outputk number of class dim(y) = k

Table 1 Symbol Table

References

V Koltchinskii and D Panchenko Empirical margindistributions and bounding the generalization errorof combined classifiers Annals of Statistics 3020022000

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

Page 3: Regularization of Neural Networks using …yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdfRegularization of Neural Networks using DropConnect DropConnect weights W (d x n) b) DropConnect

Regularization of Neural Networks using DropConnect

The overall model f(x θM) therefore maps inputdata x to an output o through a sequence of operationsgiven the parameters θ = WgWWs and randomly-drawn mask M The correct value of o is obtained bysumming out over all possible masks M

o = EM [f(x θM)] =sumM

p(M)f(x θM) (4)

This reveals the mixture model interpretation of Drop-Connect (and Dropout) where the output is a mixtureof 2|M | different networks each with weight p(M)If p = 05 then these weights are equal and o =1|M |

sumM f(x θM) = 1

|M |sumM s(a((M W )v)Ws)

31 Training

Training the model described in Section 3 begins byselecting an example x from the training set and ex-tracting features for that example v These featuresare input to the DropConnect layer where a mask ma-trix M is first drawn from a Bernoulli(p) distributionto mask out elements of both the weight matrix andthe biases in the DropConnect layer A key compo-nent to successfully training with DropConnect is theselection of a different mask for each training exam-ple Selecting a single mask for a subset of trainingexamples such as a mini-batch of 128 examples doesnot regularize the model enough in practice Since thememory requirement for the M rsquos now grows with thesize of each mini-batch the implementation needs tobe carefully designed as described in Section 5

Once a mask is chosen it is applied to the weights andbiases in order to compute the input to the activa-tion function This results in r the input to the soft-max layer which outputs class predictions from whichcross entropy between the ground truth labels is com-puted The parameters throughout the model θ thencan be updated via stochastic gradient descent (SGD)by backpropagating gradients of the loss function withrespect to the parameters Aprimeθ To update the weightmatrix W in a DropConnect layer the mask is ap-plied to the gradient to update only those elementsthat were active in the forward pass Additionallywhen passing gradients down to the feature extractorthe masked weight matrix M W is used A summaryof these steps is provided in Algorithm 1

32 Inference

At inference time we need to compute r =1|M |

sumM a((M W )v) which naively requires the

evaluation of 2|M | different masks ndash plainly infeasible

The Dropout work (Hinton et al 2012) made the ap-proximation

sumM a((M W )v) asymp a(

sumM (M W )v)

Algorithm 1 SGD Training with DropConnect

Input example x parameters θtminus1 from step tminus1learning rate ηOutput updated parameters θtForward PassExtract features v larr g(xWg)Random sample M mask Mij sim Bernoulli(p)Compute activations r = a((M W )v)Compute output o = s(rWs)Backpropagate GradientsDifferentiate loss Aprimeθ with respect to parameters θUpdate softmax layer Ws = Ws minus ηAprimeWs

Update DropConnect layer W = W minus η(M AprimeW )Update feature extractor Wg = Wg minus ηAprimeWg

Algorithm 2 Inference with DropConnect

Input example x parameters θ of samples ZOutput prediction uExtract features v larr g(xWg)Moment matching of umicrolarr EM [u] σ2 larr VM [u]

for z = 1 Z do Draw Z samplesfor i = 1 d do Loop over units in r

Sample from 1D Gaussian uiz sim N (microi σ2i )

riz larr a(uiz)end for

end forPass result r =

sumZz=1 rzZ to next layer

ie averaging before the activation rather than afterAlthough this seems to work in practice it is not jus-tified mathematically particularly for the relu activa-tion function2

We take a different approach Consider a singleunit ui before the activation function a() ui =sumj(Wijvj)Mij This is a weighted sum of Bernoulli

variables Mij which can be approximated by a Gaus-sian via moment matching The mean and varianceof the units u are EM [u] = pWv and VM [u] =p(1 minus p)(W W )(v v) We can then draw samplesfrom this Gaussian and pass them through the activa-tion function a() before averaging them and present-ing them to the next layer Algorithm 2 summarizesthe method Note that the sampling can be done ef-ficiently since the samples for each unit and exam-ple can be drawn in parallel This scheme is only anapproximation in the case of multi-layer network itworks well in practise as shown in Experiments

2Consider u sim N(0 1) with a(u) = max(u 0)

a(EM (u)) = 0 but EM (a(u)) = 1radic

2π asymp 04

Regularization of Neural Networks using DropConnect

Implementation Mask Weight Time(ms) Speedupfprop bprop acts bprop weights total

CPU float 4802 12286 16928 34016 10 timesCPU bit 3923 6791 7597 18311 19 timesGPU float(global memory) 216 62 72 350 972 timesGPU float(tex1D memory) 151 61 60 272 1260 timesGPU bit(tex2D aligned memory) 24 27 31 82 4148 timesGPU(Lower Bound) cuBlas + read mask weight 03 03 02 08

Table 1 Performance comparison between different implementations of our DropConnect layer on NVidia GTX580 GPUrelative to a 267Ghz Intel Xeon (compiled with -O3 flag) Input dimension and Output dimension are 1024 and mini-batchsize is 128 As reference we provide traditional matrix multiplication using the cuBlas library

4 Model Generalization BoundWe now show a novel bound for the Rademacher com-plexity of the model R`(F) on the training set (seeappendix for derivation)

R`(F) le p(

2radickdBsn

radicdBh

)R`(G) (5)

where max|Ws| le Bs max|W | le B k is the num-ber of classes R`(G) is the Rademacher complexity ofthe feature extractor n and d are the dimensionalityof the input and output of the DropConnect layer re-spectively The important result from Eqn 5 is thatthe complexity is a linear function of the probability pof an element being kept in DropConnect or DropoutWhen p = 0 the model complexity is zero since theinput has no influence on the output When p = 1 itreturns to the complexity of a standard model

5 Implementation Details

Our system involves three components implementedon a GPU 1) a feature extractor 2) our DropConnectlayer and 3) a softmax classification layer For 1 and3 we utilize the Cuda-convnet package (Krizhevsky2012) a fast GPU based convolutional network libraryWe implement a custom GPU kernel for performingthe operations within the DropConnect layer Ourcode is available at httpcsnyuedu~wanli

dropc

A typical fully connected layer is implemented as amatrix-matrix multiplication between the input vec-tors for a mini-batch of training examples and theweight matrix The difficulty in our case is that eachtraining example requires itrsquos own random mask ma-trix applied to the weights and biases of the DropCon-nect layer This leads to several complications

1 For a weight matrix of size d times n the correspondingmask matrix is of size dtimes ntimes b where b is the size ofthe mini-batch For a 4096times4096 fully connected layerwith mini-batch size of 128 the matrix would be toolarge to fit into GPU memory if each element is storedas a floating point number requiring 8G of memory

2 Once a random instantiation of the mask is created itis non-trivial to access all the elements required duringthe matrix multiplications so as to maximize perfor-mance

The first problem is not hard to address Each ele-ment of the mask matrix is stored as a single bit toencode the connectivity information rather than as afloat The memory cost is thus reduced by 32 timeswhich becomes 256M for the example above This notonly reduces the memory footprint but also reducesthe bandwidth required as 32 elements can be accessedwith each 4-byte read We overcome the second prob-lem using an efficient memory access pattern using 2Dtexture aligned memory These two improvements arecrucial for an efficient GPU implementation of Drop-Connect as shown in Table 1 Here we compare to anaive CPU implementation with floating point masksand get a 415times speedup with our efficient GPU design

6 Experiments

We evaluate our DropConnect model for regularizingdeep neural networks trained for image classificationAll experiments use mini-batch SGD with momentumon batches of 128 images with the momentum param-eter fixed at 09

We use the following protocol for all experiments un-less otherwise stated

bull Augment the dataset by 1) randomly selectingcropped regions from the images 2) flipping imageshorizontally 3) introducing 15 scaling and rotationvariationsbull Train 5 independent networks with random permuta-

tions of the training sequencebull Manually decrease the learning rate if the network

stops improving as in (Krizhevsky 2012) according toa schedule determined on a validation setbull Train the fully connected layer using Dropout Drop-

Connect or neither (No-Drop)bull At inference time for DropConnect we draw Z = 1000

Regularization of Neural Networks using DropConnect

samples at the inputs to the activation function of thefully connected layer and average their activations

To anneal the initial learning rate we choose a fixedmultiplier for different stages of training We reportthree numbers of epochs such as 600-400-200 to defineour schedule We multiply the initial rate by 1 for thefirst such number of epochs Then we use a multiplierof 05 for the second number of epochs followed by01 again for this second number of epochs The thirdnumber of epochs is used for multipliers of 005 0010005 and 0001 in that order after which point wereport our results We determine the epochs to use forour schedule using a validation set to look for plateausin the loss function at which point we move to thenext multiplier 3

Once the 5 networks are trained we report two num-bers 1) the mean and standard deviation of the classi-fication errors produced by each of the 5 independentnetworks and 2) the classification error that resultswhen averaging the output probabilities from the 5networks before making a prediction We find in prac-tice this voting scheme inspired by (Ciresan et al2012) provides significant performance gains achiev-ing state-of-the-art results in many standard bench-marks when combined with our DropConnect layer

61 MNIST

The MNIST handwritten digit classification task (Le-Cun et al 1998) consists of 28times28 black and white im-ages each containing a digit 0 to 9 (10-classes) Eachdigit in the 60 000 training images and 10 000 testimages is normalized to fit in a 20times20 pixel box whilepreserving their aspect ratio We scale the pixel valuesto the [0 1] range before inputting to our models

For our first experiment on this dataset we train mod-els with two fully connected layers each with 800 out-put units using either tanh sigmoid or relu activationfunctions to compare to Dropout in (Hinton et al2012) The first layer takes the image pixels as inputwhile the second layerrsquos output is fed into a 10-classsoftmax classification layer In Table 2 we show theperformance of various activations functions compar-ing No-Drop Dropout and DropConnect in the fullyconnected layers No data augmentation is utilized inthis experiment We use an initial learning rate of 01and train for 600-400-20 epochs using our schedule

From Table 2 we can see that both Dropout and Drop-

3In all experiments the bias learning rate is 2times thelearning rate for the weights Additionally weights are ini-tialized with N(0 01) random values for fully connectedlayers and N(0 001) for convolutional layers

neuron model error()5 network

votingerror()

relu No-Drop 162plusmn0037 140Dropout 128plusmn0040 120DropConnect 120plusmn0034 112

sigmoid No-Drop 178plusmn0037 174Dropout 138plusmn0039 136DropConnect 155plusmn0046 148

tanh No-Drop 165plusmn0026 149Dropout 158plusmn0053 155DropConnect 136plusmn0054 135

Table 2 MNIST classification error rate for models withtwo fully connected layers of 800 neurons each No dataaugmentation is used in this experiment

Connect perform better than not using either methodDropConnect mostly performs better than Dropout inthis task with the gap widening when utilizing thevoting over the 5 models

To further analyze the effects of DropConnect weshow three explanatory experiments in Fig 2 using a 2-layer fully connected model on MNIST digits Fig 2ashows test performance as the number of hidden unitsin each layer varies As the model size increases No-Drop overfits while both Dropout and DropConnectimprove performance DropConnect consistently givesa lower error rate than Dropout Fig 2b shows the ef-fect of varying the drop rate p for Dropout and Drop-Connect for a 400-400 unit network Both methodsgive optimal performance in the vicinity of 05 thevalue used in all other experiments in the paper Oursampling approach gives a performance gain over meaninference (as used by Hinton (Hinton et al 2012))but only for the DropConnect case In Fig 2c weplot the convergence properties of the three methodsthroughout training on a 400-400 network We cansee that No-Drop overfits quickly while Dropout andDropConnect converge slowly to ultimately give supe-rior test performance DropConnect is even slower toconverge than Dropout but yields a lower test errorin the end

In order to improve our classification result we choosea more powerful feature extractor network described in(Ciresan et al 2012) (relu is used rather than tanh)This feature extractor consists of a 2 layer CNN with32-64 feature maps in each layer respectively Thelast layerrsquos output is treated as input to the fully con-nected layer which has 150 relu units on which No-Drop Dropout or DropConnect are applied We re-port results in Table 3 from training the network ona) the original MNIST digits b) cropped 24 times 24 im-ages from random locations and c) rotated and scaledversions of these cropped images We use an initial

Regularization of Neural Networks using DropConnect

200 400 800 160011

12

13

14

15

16

17

18

19

2

21

Hidden Units

T

est E

rror

NominusDropDropoutDropConnect

0 01 02 03 04 05 06 07 08 0912

14

16

18

2

22

24

of Elements Kept

T

est E

rror

Dropout (mean)DropConnect (mean)Dropout (sampling)DropConnect (sampling)

100 200 300 400 500 600 700 800 90010minus3

10minus2

Epoch

Cro

ss E

ntro

py

NominusDrop TrainNominusDrop TestDropout TrainDropout TestDropConnect TrainDropConnect Test

Figure 2 Using the MNIST dataset in a) we analyze the ability of Dropout and DropConnect to prevent overfittingas the size of the 2 fully connected layers increase b) Varying the drop-rate in a 400-400 network shows near optimalperformance around the p = 05 proposed by (Hinton et al 2012) c) we show the convergence properties of the traintestsets See text for discussion

learning rate of 001 with a 700-200-100 epoch sched-ule no momentum and preprocess by subtracting theimage mean

crop rotationscaling

model error()5 network

votingerror()

no no No-Drop 077plusmn0051 067Dropout 059plusmn0039 052DropConnect 063plusmn0035 057

yes no No-Drop 050plusmn0098 038Dropout 039plusmn0039 035DropConnect 039plusmn0047 032

yes yes No-Drop 030plusmn0035 021Dropout 028plusmn0016 027DropConnect 028plusmn0032 021

Table 3 MNIST classification error Previous state of theart is 0 47 (Zeiler and Fergus 2013) for a single modelwithout elastic distortions and 023 with elastic distor-tions and voting (Ciresan et al 2012)

We note that our approach surpasses the state-of-the-art result of 023 (Ciresan et al 2012) achieving a021 error rate without the use of elastic distortions(as used by (Ciresan et al 2012))

62 CIFAR-10

CIFAR-10 is a data set of natural 32x32 RGB images(Krizhevsky 2009) in 10-classes with 50 000 imagesfor training and 10 000 for testing Before inputtingthese images to our network we subtract the per-pixelmean computed over the training set from each image

The first experiment on CIFAR-10 (summarized inTable 4) uses the simple convolutional network fea-ture extractor described in (Krizhevsky 2012)(layers-80seccfg) that is designed for rapid training ratherthan optimal performance On top of the 3-layerfeature extractor we have a 64 unit fully connectedlayer which uses No-Drop Dropout or DropConnectNo data augmentation is utilized for this experiment

Since this experiment is not aimed at optimal perfor-mance we report a single modelrsquos performance with-out voting We train for 150-0-0 epochs with an ini-tial learning rate of 0001 and their default weight de-cay DropConnect prevents overfitting of the fully con-nected layer better than Dropout in this experiment

model error()No-Drop 235Dropout 197DropConnect 187

Table 4 CIFAR-10 classification error using the simplefeature extractor described in (Krizhevsky 2012)(layers-80seccfg) and with no data augmentation

Table 5 shows classification results of the network us-ing a larger feature extractor with 2 convolutionallayers and 2 locally connected layers as describedin (Krizhevsky 2012)(layers-conv-local-11pctcfg) A128 neuron fully connected layer with relu activationsis added between the softmax layer and feature extrac-tor Following (Krizhevsky 2012) images are croppedto 24x24 with horizontal flips and no rotation or scal-ing is performed We use an initial learning rate of0001 and train for 700-300-50 epochs with their de-fault weight decay Model voting significantly im-proves performance when using Dropout or DropCon-nect the latter reaching an error rate of 941 Ad-ditionally we trained a model with 12 networks withDropConnect and achieved a state-of-the-art result of932 indicating the power of our approach

63 SVHN

The Street View House Numbers (SVHN) dataset in-cludes 604 388 images (both training set and extra set)and 26 032 testing images (Netzer et al 2011) Simi-lar to MNIST the goal is to classify the digit centeredin each 32x32 RGB image Due to the large variety ofcolors and brightness variations in the images we pre-

Regularization of Neural Networks using DropConnect

model error() 5 network votingerror()

No-Drop 1118plusmn 013 1022Dropout 1152plusmn 018 983DropConnect 1110plusmn 013 941

Table 5 CIFAR-10 classification error using a larger fea-ture extractor Previous state-of-the-art is 95 (Snoeket al 2012) Voting with 12 DropConnect networks pro-duces an error rate of 932 significantly beating thestate-of-the-art

process the images using local contrast normalizationas in (Zeiler and Fergus 2013) The feature extractoris the same as the larger CIFAR-10 experiment butwe instead use a larger 512 unit fully connected layerwith relu activations between the softmax layer andthe feature extractor After contrast normalizing thetraining data is randomly cropped to 28 times 28 pixelsand is rotated and scaled We do not do horizontalflips Table 6 shows the classification performance for5 models trained with an initial learning rate of 0001for a 100-50-10 epoch schedule

Due to the large training set size both Dropout andDropConnect achieve nearly the same performance asNo-Drop However using our data augmentation tech-niques and careful annealing the per model scores eas-ily surpass the previous 280 state-of-the-art resultof (Zeiler and Fergus 2013) Furthermore our vot-ing scheme reduces the relative error of the previousstate-of-to-art by 30 to achieve 194 error

model error() 5 network votingerror()

No-Drop 226plusmn 0072 194Dropout 225plusmn 0034 196DropConnect 223plusmn 0039 194

Table 6 SVHN classification error The previous state-of-the-art is 28 (Zeiler and Fergus 2013)

64 NORB

In the final experiment we evaluate our models onthe 2-fold NORB (jittered-cluttered) dataset (LeCunet al 2004) a collection of stereo images of 3D mod-els For each image one of 6 classes appears on arandom background We train on 2-folds of 29 160images each and the test on a total of 58 320 imagesThe images are downsampled from 108times108 to 48times48as in (Ciresan et al 2012)

We use the same feature extractor as the largerCIFAR-10 experiment There is a 512 unit fully con-nected layer with relu activations placed between thesoftmax layer and feature extractor Rotation andscaling of the training data is applied but we do notcrop or flip the images as we found that to hurt per-

model error()5 network

votingerror()

No-Drop 448plusmn 078 336Dropout 396plusmn 016 303DropConnect 414plusmn 006 323

Table 7 NORM classification error for the jittered-cluttered dataset using 2 training folds The previousstate-of-art is 357 (Ciresan et al 2012)

formance on this dataset We trained with an initiallearning rate of 0001 and anneal for 100-40-10 epochs

In this experiment we beat the previous state-of-the-art result of 357 using No-Drop Dropout and Drop-Connect with our voting scheme While Dropout sur-passes DropConnect slightly both methods improveover No-Drop in this benchmark as shown in Table 7

7 Discussion

We have presented DropConnect which generalizesHinton et al rsquos Dropout (Hinton et al 2012) to the en-tire connectivity structure of a fully connected neuralnetwork layer We provide both theoretical justifica-tion and empirical results to show that DropConnecthelps regularize large neural network models Resultson a range of datasets show that DropConnect oftenoutperforms Dropout While our current implementa-tion of DropConnect is slightly slower than No-Drop orDropout in large models models the feature extractoris the bottleneck thus there is little difference in over-all training time DropConnect allows us to train largemodels while avoiding overfitting This yields state-of-the-art results on a variety of standard benchmarksusing our efficient GPU implementation of DropCon-nect

8 Appendix

81 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define the DropConnect networkas a mixture model o =

sumM p(M)f(x θM) =

EM [f(x θM)]

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters W are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more M is the DropConnect layermask

Now we reformulate the cross-entropy loss on top ofthe softmax into a single parameter function that com-bines the softmax output and labels as a logistic

Regularization of Neural Networks using DropConnect

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call the logisticloss function

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties 1) Ay(0) = ln k 2) minus1 le Aprimey(o) le 1 and3)Aprimeprimey(o) ge 0

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on setX and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

82 Bound Derivation

Lemma 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions and H = [Fj ]kj=1 be a k-

dimensional function class If A Rk rarr R is a Lips-chitz function with constant L and satisfies A(0) = 0then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1`

sum`i=1Ayi(oi) + 2kR`(F) + 3

radicln(2δ)

2`

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parametrized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof R`(HG) = Eσ[suphisinHgisinG

∣∣∣ 2` sum`i=1 σih g(xi)

∣∣∣]= Eσ

[supgisinGWleB

∣∣∣langW 2`

sum`i=1 σig(xi)

rang∣∣∣]le BEσ

[supfjisinF

∥∥∥∥[ 2` sum`i=1 σ

ji fj(xi)

]dj=1

∥∥∥∥]= BradicdEσ

[supfisinF

∣∣∣ 2` sum`i=1 σif(xi)

∣∣∣] =radicdBR`(F)

Remark 1 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 This

layer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on M then R`(EM [FM ]) le EM

[R`(FM )

]Proof R`(EM [FM ]) = R`

(sumM p (m)FM

)lesum

M R`(p(m)FM ) lesumM |p(m)|R`(FM ) = EM

[R`(FM )

]

Theorem 1 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ]) le EM[R`(f(x θM)

](6)

le (radicdkBs)

radicdEM

[R`(a hM g)

](7)

= 2radickdBsEM

[R`(hM g)

](8)

where hM = (M W )v Equation (6) is based onLemma 6 Equation (7) is based on Lemma 5 andEquation (8) follows from Lemma 4

EM[R`(hM g)

]= EMσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiWTDMg(xi)

∣∣∣∣∣]

(9)

= EMσ

[sup

hisinHgisinG

∣∣∣∣∣langDMW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxWDMW

]Eσ

supgjisinG

∥∥∥∥∥∥[

2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(10)

le Bhpradicnd(radic

nR`(G))

= pnradicdBhR`(G)

where DM in Equation (9) is an diagonal matrixwith diagonal elements equal to m and inner prod-uct properties lead to Equation (10) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Regularization of Neural Networks using DropConnect

References

D Ciresan U Meier and J Schmidhuber Multi-column deep neural networks for image classifica-tion In Proceedings of the 2012 IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR) CVPR rsquo12 pages 3642ndash3649 WashingtonDC USA 2012 IEEE Computer Society ISBN 978-1-4673-1226-4

G E Hinton N Srivastava A KrizhevskyI Sutskever and R Salakhutdinov Improving neu-ral networks by preventing co-adaptation of featuredetectors CoRR abs12070580 2012

A Krizhevsky Learning Multiple Layers of Featuresfrom Tiny Images Masterrsquos thesis University ofToront 2009

A Krizhevsky cuda-convnet httpcodegoogle

compcuda-convnet 2012

Y LeCun L Bottou Y Bengio and P HaffnerGradient-based learning applied to document recog-nition Proceedings of the IEEE 86(11)2278 ndash2324nov 1998 ISSN 0018-9219 doi 1011095726791

Y LeCun F J Huang and L Bottou Learning meth-ods for generic object recognition with invariance topose and lighting In Proceedings of the 2004 IEEEcomputer society conference on Computer vision andpattern recognition CVPRrsquo04 pages 97ndash104 Wash-ington DC USA 2004 IEEE Computer Society

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

D J C Mackay Probable networks and plausiblepredictions - a review of practical bayesian methodsfor supervised neural networks In Bayesian methodsfor backpropagation networks Springer 1995

V Nair and G E Hinton Rectified Linear Units Im-prove Restricted Boltzmann Machines In ICML2010

Y Netzer T Wang Coates A A Bissacco B Wuand A Y Ng Reading digits in natural images withunsupervised feature learning In NIPS Workshopon Deep Learning and Unsupervised Feature Learn-ing 2011 2011

J Snoek H Larochelle and R A Adams Practi-cal bayesian optimization of machine learning algo-rithms In Neural Information Processing Systems2012

A S Weigend D E Rumelhart and B A HubermanGeneralization by weight-elimination with applica-tion to forecasting In NIPS 1991

M D Zeiler and R Fergus Stochastic pooling forregualization of deep convolutional neural networksIn ICLR 2013

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

Regularization of Neural Networks using DropConnectSupplementary Material

1 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define DropConnect network as amixture model

o =summ

p(M)f(x θM) = Em [f(x θM)] (1)

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters w are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more m is the DropConnect layermask

Remark 1 when each element of Mi has equal proba-bility of being on and off (p = 05) the mixture modelhas equal weights for all sub-models f(x θM) oth-erwise the mixture model has larger weights in somesub-models than others

Reformulate cross-entropy loss on top of soft-max intoa single parameter function that combines soft-maxoutput and labels Same as logistic

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call logistic lossfunction

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties

1 Ay(0) = ln k2 minus1 le Aprimey(o) le 13 Aprimeprimey(o) ge 0

The first one says A(0) is depend on some constantrelated with number of labels The second one says Ais Lipschitz function with L = 1 The third one saysA is a convex function wrt x

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on set

X and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

Theorem 1 ((Koltchinskii and Panchenko 2000))Fix δ isin (0 1) and let F be a class of functions mappingfrom M to [0 1] Let (Mi)

`iminus1 be drawn independently

according to a probability distribution D Then withprobability at least 1minusδ over random draws of samplesof size ` every f isin F satisfies

E [f(M)] le E [f(M)] +R`(F) +

radicln (2δ)

2`

le E [f(M)] + R`(F) + 3

radicln (2δ)

2`

2 Bound Derivation

Theorem 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions If A R rarr R is Lipschitzwith constant L and satisfies A(0) = 0 then R`(A F ) le 2LR(F)

Lemma 2 Let F be class of real functions and H =[Fj ]kj=1 be a k-dimensional function class If A Rk rarrR is a Lipschitz function with constant L and satisfiesA(0) = 0 then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1

`

sumi=1

Ayi(oi) + 2kR`(F) + 3

radicln (2δ)

2`

Proof From Lemma 1 Logistic loss function (A minusc)(x) isin A due to (A minus c)prime(x) le 1 and (A minus c)(0) = 0with some constant c By Lemma 2 R`((Aminusc)F) le2kR`(F)

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Regularization of Neural Networks using DropConnect Supplementary Material

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parameterized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof

R`(H G)

= Eσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σih g(xi)

∣∣∣∣∣]

= Eσ

[sup

gisinGWleB

∣∣∣∣∣langW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le BEσ

supfjisinF

∥∥∥∥∥∥[

2

`

sumi=1

σji fj(xi)

]dj=1

∥∥∥∥∥∥

= BradicdEσ

[supfisinF

∣∣∣∣∣2` sumi=1

σif(xi)

∣∣∣∣∣]

=radicdBR`(F)

Remark 2 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 Thislayer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on m then R`(EM [FM ]) le EM

[R`(FM )

]Proof

R`(EM [FM ]) = R`

(sumM

p (M)FM

)lesumM

R`(p(M)FM )

lesumM

|p(M)|R`(FM ) = EM

[R`(FM )

]because of common fact 1) R`(cF) = |c|R`(F) and 2)R`(sumi Fi) le

sumi R`(Fi)

Theorem 3 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ])

le EM[R`(f(x θM)

](2)

= EM[R`(s a hm g)

]le (

radicdkBs)

radicdEM

[R`(a hm g)

](3)

= 2radickdBsEM

[R`(hm g)

](4)

where hm = (M W )v Equation (2) is based onLemma 6 Equation (3) is based on Lemma 5 andEquation (4) follows from Lemma 4

EM[R`(hm g)

]= Emσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiwTDMg(xi)

∣∣∣∣∣]

(5)

= Emσ

[sup

hisinHgisinG

∣∣∣∣∣langDMw

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxwDMw

]Eσ

supgjisinG

∥∥∥∥∥∥[2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(6)

le Bhpradicnd(radic

nR`(G))= pn

radicdBhR`(G)

where DM in Equation (5) is an diagonal matrix withdiagonal elements equal to m and inner product prop-erties lead to Equation (6) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Remark 3 Theorem 3 implies that p is an additionalregularizer we have added to network when we converta normal neural network to a network with DropCon-nect layers Consider the following extreme cases

1 p = 0 the network generalization bound equals to0 which is true because classifier does not depends oninput any more

2 p = 1 reduce to normal network

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Regularization of Neural Networks using DropConnect Supplementary Material

Symbol Description Related Formulay Data Label can either be integer label for bit vec-

tor(depends on context)x Network input datag() Feature extractor function with parameter Wg

v Feature extractor network output v = g(xWg)M DropConnect connection information parameter

(weight mask)h() DropConnect transformation function with parame-

ter WMu DropConnect output u = h(vWM)a() DropConnect activation functionr DropConnect after activation r = a(u)s() Dimension reduction layer function with parameter

Ws

o Dimension reduction layer output (network output) o = s(rWs)θ All parameter of network expect weight mask θ = WsWWgf() Overall classifier(network) output o = f(x θM)λ Weight penaltyA() Data Loss Function A(ominus y)L() Over all objective function L(x y) =

sumiA(oi minus yi) + 12λW22

n Dimension of feature extractor outputd Dimension of DropConnect layer outputk number of class dim(y) = k

Table 1 Symbol Table

References

V Koltchinskii and D Panchenko Empirical margindistributions and bounding the generalization errorof combined classifiers Annals of Statistics 3020022000

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

Page 4: Regularization of Neural Networks using …yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdfRegularization of Neural Networks using DropConnect DropConnect weights W (d x n) b) DropConnect

Regularization of Neural Networks using DropConnect

Implementation Mask Weight Time(ms) Speedupfprop bprop acts bprop weights total

CPU float 4802 12286 16928 34016 10 timesCPU bit 3923 6791 7597 18311 19 timesGPU float(global memory) 216 62 72 350 972 timesGPU float(tex1D memory) 151 61 60 272 1260 timesGPU bit(tex2D aligned memory) 24 27 31 82 4148 timesGPU(Lower Bound) cuBlas + read mask weight 03 03 02 08

Table 1 Performance comparison between different implementations of our DropConnect layer on NVidia GTX580 GPUrelative to a 267Ghz Intel Xeon (compiled with -O3 flag) Input dimension and Output dimension are 1024 and mini-batchsize is 128 As reference we provide traditional matrix multiplication using the cuBlas library

4 Model Generalization BoundWe now show a novel bound for the Rademacher com-plexity of the model R`(F) on the training set (seeappendix for derivation)

R`(F) le p(

2radickdBsn

radicdBh

)R`(G) (5)

where max|Ws| le Bs max|W | le B k is the num-ber of classes R`(G) is the Rademacher complexity ofthe feature extractor n and d are the dimensionalityof the input and output of the DropConnect layer re-spectively The important result from Eqn 5 is thatthe complexity is a linear function of the probability pof an element being kept in DropConnect or DropoutWhen p = 0 the model complexity is zero since theinput has no influence on the output When p = 1 itreturns to the complexity of a standard model

5 Implementation Details

Our system involves three components implementedon a GPU 1) a feature extractor 2) our DropConnectlayer and 3) a softmax classification layer For 1 and3 we utilize the Cuda-convnet package (Krizhevsky2012) a fast GPU based convolutional network libraryWe implement a custom GPU kernel for performingthe operations within the DropConnect layer Ourcode is available at httpcsnyuedu~wanli

dropc

A typical fully connected layer is implemented as amatrix-matrix multiplication between the input vec-tors for a mini-batch of training examples and theweight matrix The difficulty in our case is that eachtraining example requires itrsquos own random mask ma-trix applied to the weights and biases of the DropCon-nect layer This leads to several complications

1 For a weight matrix of size d times n the correspondingmask matrix is of size dtimes ntimes b where b is the size ofthe mini-batch For a 4096times4096 fully connected layerwith mini-batch size of 128 the matrix would be toolarge to fit into GPU memory if each element is storedas a floating point number requiring 8G of memory

2 Once a random instantiation of the mask is created itis non-trivial to access all the elements required duringthe matrix multiplications so as to maximize perfor-mance

The first problem is not hard to address Each ele-ment of the mask matrix is stored as a single bit toencode the connectivity information rather than as afloat The memory cost is thus reduced by 32 timeswhich becomes 256M for the example above This notonly reduces the memory footprint but also reducesthe bandwidth required as 32 elements can be accessedwith each 4-byte read We overcome the second prob-lem using an efficient memory access pattern using 2Dtexture aligned memory These two improvements arecrucial for an efficient GPU implementation of Drop-Connect as shown in Table 1 Here we compare to anaive CPU implementation with floating point masksand get a 415times speedup with our efficient GPU design

6 Experiments

We evaluate our DropConnect model for regularizingdeep neural networks trained for image classificationAll experiments use mini-batch SGD with momentumon batches of 128 images with the momentum param-eter fixed at 09

We use the following protocol for all experiments un-less otherwise stated

bull Augment the dataset by 1) randomly selectingcropped regions from the images 2) flipping imageshorizontally 3) introducing 15 scaling and rotationvariationsbull Train 5 independent networks with random permuta-

tions of the training sequencebull Manually decrease the learning rate if the network

stops improving as in (Krizhevsky 2012) according toa schedule determined on a validation setbull Train the fully connected layer using Dropout Drop-

Connect or neither (No-Drop)bull At inference time for DropConnect we draw Z = 1000

Regularization of Neural Networks using DropConnect

samples at the inputs to the activation function of thefully connected layer and average their activations

To anneal the initial learning rate we choose a fixedmultiplier for different stages of training We reportthree numbers of epochs such as 600-400-200 to defineour schedule We multiply the initial rate by 1 for thefirst such number of epochs Then we use a multiplierof 05 for the second number of epochs followed by01 again for this second number of epochs The thirdnumber of epochs is used for multipliers of 005 0010005 and 0001 in that order after which point wereport our results We determine the epochs to use forour schedule using a validation set to look for plateausin the loss function at which point we move to thenext multiplier 3

Once the 5 networks are trained we report two num-bers 1) the mean and standard deviation of the classi-fication errors produced by each of the 5 independentnetworks and 2) the classification error that resultswhen averaging the output probabilities from the 5networks before making a prediction We find in prac-tice this voting scheme inspired by (Ciresan et al2012) provides significant performance gains achiev-ing state-of-the-art results in many standard bench-marks when combined with our DropConnect layer

61 MNIST

The MNIST handwritten digit classification task (Le-Cun et al 1998) consists of 28times28 black and white im-ages each containing a digit 0 to 9 (10-classes) Eachdigit in the 60 000 training images and 10 000 testimages is normalized to fit in a 20times20 pixel box whilepreserving their aspect ratio We scale the pixel valuesto the [0 1] range before inputting to our models

For our first experiment on this dataset we train mod-els with two fully connected layers each with 800 out-put units using either tanh sigmoid or relu activationfunctions to compare to Dropout in (Hinton et al2012) The first layer takes the image pixels as inputwhile the second layerrsquos output is fed into a 10-classsoftmax classification layer In Table 2 we show theperformance of various activations functions compar-ing No-Drop Dropout and DropConnect in the fullyconnected layers No data augmentation is utilized inthis experiment We use an initial learning rate of 01and train for 600-400-20 epochs using our schedule

From Table 2 we can see that both Dropout and Drop-

3In all experiments the bias learning rate is 2times thelearning rate for the weights Additionally weights are ini-tialized with N(0 01) random values for fully connectedlayers and N(0 001) for convolutional layers

neuron model error()5 network

votingerror()

relu No-Drop 162plusmn0037 140Dropout 128plusmn0040 120DropConnect 120plusmn0034 112

sigmoid No-Drop 178plusmn0037 174Dropout 138plusmn0039 136DropConnect 155plusmn0046 148

tanh No-Drop 165plusmn0026 149Dropout 158plusmn0053 155DropConnect 136plusmn0054 135

Table 2 MNIST classification error rate for models withtwo fully connected layers of 800 neurons each No dataaugmentation is used in this experiment

Connect perform better than not using either methodDropConnect mostly performs better than Dropout inthis task with the gap widening when utilizing thevoting over the 5 models

To further analyze the effects of DropConnect weshow three explanatory experiments in Fig 2 using a 2-layer fully connected model on MNIST digits Fig 2ashows test performance as the number of hidden unitsin each layer varies As the model size increases No-Drop overfits while both Dropout and DropConnectimprove performance DropConnect consistently givesa lower error rate than Dropout Fig 2b shows the ef-fect of varying the drop rate p for Dropout and Drop-Connect for a 400-400 unit network Both methodsgive optimal performance in the vicinity of 05 thevalue used in all other experiments in the paper Oursampling approach gives a performance gain over meaninference (as used by Hinton (Hinton et al 2012))but only for the DropConnect case In Fig 2c weplot the convergence properties of the three methodsthroughout training on a 400-400 network We cansee that No-Drop overfits quickly while Dropout andDropConnect converge slowly to ultimately give supe-rior test performance DropConnect is even slower toconverge than Dropout but yields a lower test errorin the end

In order to improve our classification result we choosea more powerful feature extractor network described in(Ciresan et al 2012) (relu is used rather than tanh)This feature extractor consists of a 2 layer CNN with32-64 feature maps in each layer respectively Thelast layerrsquos output is treated as input to the fully con-nected layer which has 150 relu units on which No-Drop Dropout or DropConnect are applied We re-port results in Table 3 from training the network ona) the original MNIST digits b) cropped 24 times 24 im-ages from random locations and c) rotated and scaledversions of these cropped images We use an initial

Regularization of Neural Networks using DropConnect

200 400 800 160011

12

13

14

15

16

17

18

19

2

21

Hidden Units

T

est E

rror

NominusDropDropoutDropConnect

0 01 02 03 04 05 06 07 08 0912

14

16

18

2

22

24

of Elements Kept

T

est E

rror

Dropout (mean)DropConnect (mean)Dropout (sampling)DropConnect (sampling)

100 200 300 400 500 600 700 800 90010minus3

10minus2

Epoch

Cro

ss E

ntro

py

NominusDrop TrainNominusDrop TestDropout TrainDropout TestDropConnect TrainDropConnect Test

Figure 2 Using the MNIST dataset in a) we analyze the ability of Dropout and DropConnect to prevent overfittingas the size of the 2 fully connected layers increase b) Varying the drop-rate in a 400-400 network shows near optimalperformance around the p = 05 proposed by (Hinton et al 2012) c) we show the convergence properties of the traintestsets See text for discussion

learning rate of 001 with a 700-200-100 epoch sched-ule no momentum and preprocess by subtracting theimage mean

crop rotationscaling

model error()5 network

votingerror()

no no No-Drop 077plusmn0051 067Dropout 059plusmn0039 052DropConnect 063plusmn0035 057

yes no No-Drop 050plusmn0098 038Dropout 039plusmn0039 035DropConnect 039plusmn0047 032

yes yes No-Drop 030plusmn0035 021Dropout 028plusmn0016 027DropConnect 028plusmn0032 021

Table 3 MNIST classification error Previous state of theart is 0 47 (Zeiler and Fergus 2013) for a single modelwithout elastic distortions and 023 with elastic distor-tions and voting (Ciresan et al 2012)

We note that our approach surpasses the state-of-the-art result of 023 (Ciresan et al 2012) achieving a021 error rate without the use of elastic distortions(as used by (Ciresan et al 2012))

62 CIFAR-10

CIFAR-10 is a data set of natural 32x32 RGB images(Krizhevsky 2009) in 10-classes with 50 000 imagesfor training and 10 000 for testing Before inputtingthese images to our network we subtract the per-pixelmean computed over the training set from each image

The first experiment on CIFAR-10 (summarized inTable 4) uses the simple convolutional network fea-ture extractor described in (Krizhevsky 2012)(layers-80seccfg) that is designed for rapid training ratherthan optimal performance On top of the 3-layerfeature extractor we have a 64 unit fully connectedlayer which uses No-Drop Dropout or DropConnectNo data augmentation is utilized for this experiment

Since this experiment is not aimed at optimal perfor-mance we report a single modelrsquos performance with-out voting We train for 150-0-0 epochs with an ini-tial learning rate of 0001 and their default weight de-cay DropConnect prevents overfitting of the fully con-nected layer better than Dropout in this experiment

model error()No-Drop 235Dropout 197DropConnect 187

Table 4 CIFAR-10 classification error using the simplefeature extractor described in (Krizhevsky 2012)(layers-80seccfg) and with no data augmentation

Table 5 shows classification results of the network us-ing a larger feature extractor with 2 convolutionallayers and 2 locally connected layers as describedin (Krizhevsky 2012)(layers-conv-local-11pctcfg) A128 neuron fully connected layer with relu activationsis added between the softmax layer and feature extrac-tor Following (Krizhevsky 2012) images are croppedto 24x24 with horizontal flips and no rotation or scal-ing is performed We use an initial learning rate of0001 and train for 700-300-50 epochs with their de-fault weight decay Model voting significantly im-proves performance when using Dropout or DropCon-nect the latter reaching an error rate of 941 Ad-ditionally we trained a model with 12 networks withDropConnect and achieved a state-of-the-art result of932 indicating the power of our approach

63 SVHN

The Street View House Numbers (SVHN) dataset in-cludes 604 388 images (both training set and extra set)and 26 032 testing images (Netzer et al 2011) Simi-lar to MNIST the goal is to classify the digit centeredin each 32x32 RGB image Due to the large variety ofcolors and brightness variations in the images we pre-

Regularization of Neural Networks using DropConnect

model error() 5 network votingerror()

No-Drop 1118plusmn 013 1022Dropout 1152plusmn 018 983DropConnect 1110plusmn 013 941

Table 5 CIFAR-10 classification error using a larger fea-ture extractor Previous state-of-the-art is 95 (Snoeket al 2012) Voting with 12 DropConnect networks pro-duces an error rate of 932 significantly beating thestate-of-the-art

process the images using local contrast normalizationas in (Zeiler and Fergus 2013) The feature extractoris the same as the larger CIFAR-10 experiment butwe instead use a larger 512 unit fully connected layerwith relu activations between the softmax layer andthe feature extractor After contrast normalizing thetraining data is randomly cropped to 28 times 28 pixelsand is rotated and scaled We do not do horizontalflips Table 6 shows the classification performance for5 models trained with an initial learning rate of 0001for a 100-50-10 epoch schedule

Due to the large training set size both Dropout andDropConnect achieve nearly the same performance asNo-Drop However using our data augmentation tech-niques and careful annealing the per model scores eas-ily surpass the previous 280 state-of-the-art resultof (Zeiler and Fergus 2013) Furthermore our vot-ing scheme reduces the relative error of the previousstate-of-to-art by 30 to achieve 194 error

model error() 5 network votingerror()

No-Drop 226plusmn 0072 194Dropout 225plusmn 0034 196DropConnect 223plusmn 0039 194

Table 6 SVHN classification error The previous state-of-the-art is 28 (Zeiler and Fergus 2013)

64 NORB

In the final experiment we evaluate our models onthe 2-fold NORB (jittered-cluttered) dataset (LeCunet al 2004) a collection of stereo images of 3D mod-els For each image one of 6 classes appears on arandom background We train on 2-folds of 29 160images each and the test on a total of 58 320 imagesThe images are downsampled from 108times108 to 48times48as in (Ciresan et al 2012)

We use the same feature extractor as the largerCIFAR-10 experiment There is a 512 unit fully con-nected layer with relu activations placed between thesoftmax layer and feature extractor Rotation andscaling of the training data is applied but we do notcrop or flip the images as we found that to hurt per-

model error()5 network

votingerror()

No-Drop 448plusmn 078 336Dropout 396plusmn 016 303DropConnect 414plusmn 006 323

Table 7 NORM classification error for the jittered-cluttered dataset using 2 training folds The previousstate-of-art is 357 (Ciresan et al 2012)

formance on this dataset We trained with an initiallearning rate of 0001 and anneal for 100-40-10 epochs

In this experiment we beat the previous state-of-the-art result of 357 using No-Drop Dropout and Drop-Connect with our voting scheme While Dropout sur-passes DropConnect slightly both methods improveover No-Drop in this benchmark as shown in Table 7

7 Discussion

We have presented DropConnect which generalizesHinton et al rsquos Dropout (Hinton et al 2012) to the en-tire connectivity structure of a fully connected neuralnetwork layer We provide both theoretical justifica-tion and empirical results to show that DropConnecthelps regularize large neural network models Resultson a range of datasets show that DropConnect oftenoutperforms Dropout While our current implementa-tion of DropConnect is slightly slower than No-Drop orDropout in large models models the feature extractoris the bottleneck thus there is little difference in over-all training time DropConnect allows us to train largemodels while avoiding overfitting This yields state-of-the-art results on a variety of standard benchmarksusing our efficient GPU implementation of DropCon-nect

8 Appendix

81 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define the DropConnect networkas a mixture model o =

sumM p(M)f(x θM) =

EM [f(x θM)]

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters W are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more M is the DropConnect layermask

Now we reformulate the cross-entropy loss on top ofthe softmax into a single parameter function that com-bines the softmax output and labels as a logistic

Regularization of Neural Networks using DropConnect

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call the logisticloss function

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties 1) Ay(0) = ln k 2) minus1 le Aprimey(o) le 1 and3)Aprimeprimey(o) ge 0

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on setX and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

82 Bound Derivation

Lemma 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions and H = [Fj ]kj=1 be a k-

dimensional function class If A Rk rarr R is a Lips-chitz function with constant L and satisfies A(0) = 0then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1`

sum`i=1Ayi(oi) + 2kR`(F) + 3

radicln(2δ)

2`

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parametrized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof R`(HG) = Eσ[suphisinHgisinG

∣∣∣ 2` sum`i=1 σih g(xi)

∣∣∣]= Eσ

[supgisinGWleB

∣∣∣langW 2`

sum`i=1 σig(xi)

rang∣∣∣]le BEσ

[supfjisinF

∥∥∥∥[ 2` sum`i=1 σ

ji fj(xi)

]dj=1

∥∥∥∥]= BradicdEσ

[supfisinF

∣∣∣ 2` sum`i=1 σif(xi)

∣∣∣] =radicdBR`(F)

Remark 1 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 This

layer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on M then R`(EM [FM ]) le EM

[R`(FM )

]Proof R`(EM [FM ]) = R`

(sumM p (m)FM

)lesum

M R`(p(m)FM ) lesumM |p(m)|R`(FM ) = EM

[R`(FM )

]

Theorem 1 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ]) le EM[R`(f(x θM)

](6)

le (radicdkBs)

radicdEM

[R`(a hM g)

](7)

= 2radickdBsEM

[R`(hM g)

](8)

where hM = (M W )v Equation (6) is based onLemma 6 Equation (7) is based on Lemma 5 andEquation (8) follows from Lemma 4

EM[R`(hM g)

]= EMσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiWTDMg(xi)

∣∣∣∣∣]

(9)

= EMσ

[sup

hisinHgisinG

∣∣∣∣∣langDMW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxWDMW

]Eσ

supgjisinG

∥∥∥∥∥∥[

2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(10)

le Bhpradicnd(radic

nR`(G))

= pnradicdBhR`(G)

where DM in Equation (9) is an diagonal matrixwith diagonal elements equal to m and inner prod-uct properties lead to Equation (10) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Regularization of Neural Networks using DropConnect

References

D Ciresan U Meier and J Schmidhuber Multi-column deep neural networks for image classifica-tion In Proceedings of the 2012 IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR) CVPR rsquo12 pages 3642ndash3649 WashingtonDC USA 2012 IEEE Computer Society ISBN 978-1-4673-1226-4

G E Hinton N Srivastava A KrizhevskyI Sutskever and R Salakhutdinov Improving neu-ral networks by preventing co-adaptation of featuredetectors CoRR abs12070580 2012

A Krizhevsky Learning Multiple Layers of Featuresfrom Tiny Images Masterrsquos thesis University ofToront 2009

A Krizhevsky cuda-convnet httpcodegoogle

compcuda-convnet 2012

Y LeCun L Bottou Y Bengio and P HaffnerGradient-based learning applied to document recog-nition Proceedings of the IEEE 86(11)2278 ndash2324nov 1998 ISSN 0018-9219 doi 1011095726791

Y LeCun F J Huang and L Bottou Learning meth-ods for generic object recognition with invariance topose and lighting In Proceedings of the 2004 IEEEcomputer society conference on Computer vision andpattern recognition CVPRrsquo04 pages 97ndash104 Wash-ington DC USA 2004 IEEE Computer Society

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

D J C Mackay Probable networks and plausiblepredictions - a review of practical bayesian methodsfor supervised neural networks In Bayesian methodsfor backpropagation networks Springer 1995

V Nair and G E Hinton Rectified Linear Units Im-prove Restricted Boltzmann Machines In ICML2010

Y Netzer T Wang Coates A A Bissacco B Wuand A Y Ng Reading digits in natural images withunsupervised feature learning In NIPS Workshopon Deep Learning and Unsupervised Feature Learn-ing 2011 2011

J Snoek H Larochelle and R A Adams Practi-cal bayesian optimization of machine learning algo-rithms In Neural Information Processing Systems2012

A S Weigend D E Rumelhart and B A HubermanGeneralization by weight-elimination with applica-tion to forecasting In NIPS 1991

M D Zeiler and R Fergus Stochastic pooling forregualization of deep convolutional neural networksIn ICLR 2013

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

Regularization of Neural Networks using DropConnectSupplementary Material

1 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define DropConnect network as amixture model

o =summ

p(M)f(x θM) = Em [f(x θM)] (1)

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters w are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more m is the DropConnect layermask

Remark 1 when each element of Mi has equal proba-bility of being on and off (p = 05) the mixture modelhas equal weights for all sub-models f(x θM) oth-erwise the mixture model has larger weights in somesub-models than others

Reformulate cross-entropy loss on top of soft-max intoa single parameter function that combines soft-maxoutput and labels Same as logistic

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call logistic lossfunction

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties

1 Ay(0) = ln k2 minus1 le Aprimey(o) le 13 Aprimeprimey(o) ge 0

The first one says A(0) is depend on some constantrelated with number of labels The second one says Ais Lipschitz function with L = 1 The third one saysA is a convex function wrt x

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on set

X and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

Theorem 1 ((Koltchinskii and Panchenko 2000))Fix δ isin (0 1) and let F be a class of functions mappingfrom M to [0 1] Let (Mi)

`iminus1 be drawn independently

according to a probability distribution D Then withprobability at least 1minusδ over random draws of samplesof size ` every f isin F satisfies

E [f(M)] le E [f(M)] +R`(F) +

radicln (2δ)

2`

le E [f(M)] + R`(F) + 3

radicln (2δ)

2`

2 Bound Derivation

Theorem 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions If A R rarr R is Lipschitzwith constant L and satisfies A(0) = 0 then R`(A F ) le 2LR(F)

Lemma 2 Let F be class of real functions and H =[Fj ]kj=1 be a k-dimensional function class If A Rk rarrR is a Lipschitz function with constant L and satisfiesA(0) = 0 then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1

`

sumi=1

Ayi(oi) + 2kR`(F) + 3

radicln (2δ)

2`

Proof From Lemma 1 Logistic loss function (A minusc)(x) isin A due to (A minus c)prime(x) le 1 and (A minus c)(0) = 0with some constant c By Lemma 2 R`((Aminusc)F) le2kR`(F)

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Regularization of Neural Networks using DropConnect Supplementary Material

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parameterized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof

R`(H G)

= Eσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σih g(xi)

∣∣∣∣∣]

= Eσ

[sup

gisinGWleB

∣∣∣∣∣langW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le BEσ

supfjisinF

∥∥∥∥∥∥[

2

`

sumi=1

σji fj(xi)

]dj=1

∥∥∥∥∥∥

= BradicdEσ

[supfisinF

∣∣∣∣∣2` sumi=1

σif(xi)

∣∣∣∣∣]

=radicdBR`(F)

Remark 2 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 Thislayer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on m then R`(EM [FM ]) le EM

[R`(FM )

]Proof

R`(EM [FM ]) = R`

(sumM

p (M)FM

)lesumM

R`(p(M)FM )

lesumM

|p(M)|R`(FM ) = EM

[R`(FM )

]because of common fact 1) R`(cF) = |c|R`(F) and 2)R`(sumi Fi) le

sumi R`(Fi)

Theorem 3 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ])

le EM[R`(f(x θM)

](2)

= EM[R`(s a hm g)

]le (

radicdkBs)

radicdEM

[R`(a hm g)

](3)

= 2radickdBsEM

[R`(hm g)

](4)

where hm = (M W )v Equation (2) is based onLemma 6 Equation (3) is based on Lemma 5 andEquation (4) follows from Lemma 4

EM[R`(hm g)

]= Emσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiwTDMg(xi)

∣∣∣∣∣]

(5)

= Emσ

[sup

hisinHgisinG

∣∣∣∣∣langDMw

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxwDMw

]Eσ

supgjisinG

∥∥∥∥∥∥[2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(6)

le Bhpradicnd(radic

nR`(G))= pn

radicdBhR`(G)

where DM in Equation (5) is an diagonal matrix withdiagonal elements equal to m and inner product prop-erties lead to Equation (6) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Remark 3 Theorem 3 implies that p is an additionalregularizer we have added to network when we converta normal neural network to a network with DropCon-nect layers Consider the following extreme cases

1 p = 0 the network generalization bound equals to0 which is true because classifier does not depends oninput any more

2 p = 1 reduce to normal network

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Regularization of Neural Networks using DropConnect Supplementary Material

Symbol Description Related Formulay Data Label can either be integer label for bit vec-

tor(depends on context)x Network input datag() Feature extractor function with parameter Wg

v Feature extractor network output v = g(xWg)M DropConnect connection information parameter

(weight mask)h() DropConnect transformation function with parame-

ter WMu DropConnect output u = h(vWM)a() DropConnect activation functionr DropConnect after activation r = a(u)s() Dimension reduction layer function with parameter

Ws

o Dimension reduction layer output (network output) o = s(rWs)θ All parameter of network expect weight mask θ = WsWWgf() Overall classifier(network) output o = f(x θM)λ Weight penaltyA() Data Loss Function A(ominus y)L() Over all objective function L(x y) =

sumiA(oi minus yi) + 12λW22

n Dimension of feature extractor outputd Dimension of DropConnect layer outputk number of class dim(y) = k

Table 1 Symbol Table

References

V Koltchinskii and D Panchenko Empirical margindistributions and bounding the generalization errorof combined classifiers Annals of Statistics 3020022000

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

Page 5: Regularization of Neural Networks using …yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdfRegularization of Neural Networks using DropConnect DropConnect weights W (d x n) b) DropConnect

Regularization of Neural Networks using DropConnect

samples at the inputs to the activation function of thefully connected layer and average their activations

To anneal the initial learning rate we choose a fixedmultiplier for different stages of training We reportthree numbers of epochs such as 600-400-200 to defineour schedule We multiply the initial rate by 1 for thefirst such number of epochs Then we use a multiplierof 05 for the second number of epochs followed by01 again for this second number of epochs The thirdnumber of epochs is used for multipliers of 005 0010005 and 0001 in that order after which point wereport our results We determine the epochs to use forour schedule using a validation set to look for plateausin the loss function at which point we move to thenext multiplier 3

Once the 5 networks are trained we report two num-bers 1) the mean and standard deviation of the classi-fication errors produced by each of the 5 independentnetworks and 2) the classification error that resultswhen averaging the output probabilities from the 5networks before making a prediction We find in prac-tice this voting scheme inspired by (Ciresan et al2012) provides significant performance gains achiev-ing state-of-the-art results in many standard bench-marks when combined with our DropConnect layer

61 MNIST

The MNIST handwritten digit classification task (Le-Cun et al 1998) consists of 28times28 black and white im-ages each containing a digit 0 to 9 (10-classes) Eachdigit in the 60 000 training images and 10 000 testimages is normalized to fit in a 20times20 pixel box whilepreserving their aspect ratio We scale the pixel valuesto the [0 1] range before inputting to our models

For our first experiment on this dataset we train mod-els with two fully connected layers each with 800 out-put units using either tanh sigmoid or relu activationfunctions to compare to Dropout in (Hinton et al2012) The first layer takes the image pixels as inputwhile the second layerrsquos output is fed into a 10-classsoftmax classification layer In Table 2 we show theperformance of various activations functions compar-ing No-Drop Dropout and DropConnect in the fullyconnected layers No data augmentation is utilized inthis experiment We use an initial learning rate of 01and train for 600-400-20 epochs using our schedule

From Table 2 we can see that both Dropout and Drop-

3In all experiments the bias learning rate is 2times thelearning rate for the weights Additionally weights are ini-tialized with N(0 01) random values for fully connectedlayers and N(0 001) for convolutional layers

neuron model error()5 network

votingerror()

relu No-Drop 162plusmn0037 140Dropout 128plusmn0040 120DropConnect 120plusmn0034 112

sigmoid No-Drop 178plusmn0037 174Dropout 138plusmn0039 136DropConnect 155plusmn0046 148

tanh No-Drop 165plusmn0026 149Dropout 158plusmn0053 155DropConnect 136plusmn0054 135

Table 2 MNIST classification error rate for models withtwo fully connected layers of 800 neurons each No dataaugmentation is used in this experiment

Connect perform better than not using either methodDropConnect mostly performs better than Dropout inthis task with the gap widening when utilizing thevoting over the 5 models

To further analyze the effects of DropConnect weshow three explanatory experiments in Fig 2 using a 2-layer fully connected model on MNIST digits Fig 2ashows test performance as the number of hidden unitsin each layer varies As the model size increases No-Drop overfits while both Dropout and DropConnectimprove performance DropConnect consistently givesa lower error rate than Dropout Fig 2b shows the ef-fect of varying the drop rate p for Dropout and Drop-Connect for a 400-400 unit network Both methodsgive optimal performance in the vicinity of 05 thevalue used in all other experiments in the paper Oursampling approach gives a performance gain over meaninference (as used by Hinton (Hinton et al 2012))but only for the DropConnect case In Fig 2c weplot the convergence properties of the three methodsthroughout training on a 400-400 network We cansee that No-Drop overfits quickly while Dropout andDropConnect converge slowly to ultimately give supe-rior test performance DropConnect is even slower toconverge than Dropout but yields a lower test errorin the end

In order to improve our classification result we choosea more powerful feature extractor network described in(Ciresan et al 2012) (relu is used rather than tanh)This feature extractor consists of a 2 layer CNN with32-64 feature maps in each layer respectively Thelast layerrsquos output is treated as input to the fully con-nected layer which has 150 relu units on which No-Drop Dropout or DropConnect are applied We re-port results in Table 3 from training the network ona) the original MNIST digits b) cropped 24 times 24 im-ages from random locations and c) rotated and scaledversions of these cropped images We use an initial

Regularization of Neural Networks using DropConnect

200 400 800 160011

12

13

14

15

16

17

18

19

2

21

Hidden Units

T

est E

rror

NominusDropDropoutDropConnect

0 01 02 03 04 05 06 07 08 0912

14

16

18

2

22

24

of Elements Kept

T

est E

rror

Dropout (mean)DropConnect (mean)Dropout (sampling)DropConnect (sampling)

100 200 300 400 500 600 700 800 90010minus3

10minus2

Epoch

Cro

ss E

ntro

py

NominusDrop TrainNominusDrop TestDropout TrainDropout TestDropConnect TrainDropConnect Test

Figure 2 Using the MNIST dataset in a) we analyze the ability of Dropout and DropConnect to prevent overfittingas the size of the 2 fully connected layers increase b) Varying the drop-rate in a 400-400 network shows near optimalperformance around the p = 05 proposed by (Hinton et al 2012) c) we show the convergence properties of the traintestsets See text for discussion

learning rate of 001 with a 700-200-100 epoch sched-ule no momentum and preprocess by subtracting theimage mean

crop rotationscaling

model error()5 network

votingerror()

no no No-Drop 077plusmn0051 067Dropout 059plusmn0039 052DropConnect 063plusmn0035 057

yes no No-Drop 050plusmn0098 038Dropout 039plusmn0039 035DropConnect 039plusmn0047 032

yes yes No-Drop 030plusmn0035 021Dropout 028plusmn0016 027DropConnect 028plusmn0032 021

Table 3 MNIST classification error Previous state of theart is 0 47 (Zeiler and Fergus 2013) for a single modelwithout elastic distortions and 023 with elastic distor-tions and voting (Ciresan et al 2012)

We note that our approach surpasses the state-of-the-art result of 023 (Ciresan et al 2012) achieving a021 error rate without the use of elastic distortions(as used by (Ciresan et al 2012))

62 CIFAR-10

CIFAR-10 is a data set of natural 32x32 RGB images(Krizhevsky 2009) in 10-classes with 50 000 imagesfor training and 10 000 for testing Before inputtingthese images to our network we subtract the per-pixelmean computed over the training set from each image

The first experiment on CIFAR-10 (summarized inTable 4) uses the simple convolutional network fea-ture extractor described in (Krizhevsky 2012)(layers-80seccfg) that is designed for rapid training ratherthan optimal performance On top of the 3-layerfeature extractor we have a 64 unit fully connectedlayer which uses No-Drop Dropout or DropConnectNo data augmentation is utilized for this experiment

Since this experiment is not aimed at optimal perfor-mance we report a single modelrsquos performance with-out voting We train for 150-0-0 epochs with an ini-tial learning rate of 0001 and their default weight de-cay DropConnect prevents overfitting of the fully con-nected layer better than Dropout in this experiment

model error()No-Drop 235Dropout 197DropConnect 187

Table 4 CIFAR-10 classification error using the simplefeature extractor described in (Krizhevsky 2012)(layers-80seccfg) and with no data augmentation

Table 5 shows classification results of the network us-ing a larger feature extractor with 2 convolutionallayers and 2 locally connected layers as describedin (Krizhevsky 2012)(layers-conv-local-11pctcfg) A128 neuron fully connected layer with relu activationsis added between the softmax layer and feature extrac-tor Following (Krizhevsky 2012) images are croppedto 24x24 with horizontal flips and no rotation or scal-ing is performed We use an initial learning rate of0001 and train for 700-300-50 epochs with their de-fault weight decay Model voting significantly im-proves performance when using Dropout or DropCon-nect the latter reaching an error rate of 941 Ad-ditionally we trained a model with 12 networks withDropConnect and achieved a state-of-the-art result of932 indicating the power of our approach

63 SVHN

The Street View House Numbers (SVHN) dataset in-cludes 604 388 images (both training set and extra set)and 26 032 testing images (Netzer et al 2011) Simi-lar to MNIST the goal is to classify the digit centeredin each 32x32 RGB image Due to the large variety ofcolors and brightness variations in the images we pre-

Regularization of Neural Networks using DropConnect

model error() 5 network votingerror()

No-Drop 1118plusmn 013 1022Dropout 1152plusmn 018 983DropConnect 1110plusmn 013 941

Table 5 CIFAR-10 classification error using a larger fea-ture extractor Previous state-of-the-art is 95 (Snoeket al 2012) Voting with 12 DropConnect networks pro-duces an error rate of 932 significantly beating thestate-of-the-art

process the images using local contrast normalizationas in (Zeiler and Fergus 2013) The feature extractoris the same as the larger CIFAR-10 experiment butwe instead use a larger 512 unit fully connected layerwith relu activations between the softmax layer andthe feature extractor After contrast normalizing thetraining data is randomly cropped to 28 times 28 pixelsand is rotated and scaled We do not do horizontalflips Table 6 shows the classification performance for5 models trained with an initial learning rate of 0001for a 100-50-10 epoch schedule

Due to the large training set size both Dropout andDropConnect achieve nearly the same performance asNo-Drop However using our data augmentation tech-niques and careful annealing the per model scores eas-ily surpass the previous 280 state-of-the-art resultof (Zeiler and Fergus 2013) Furthermore our vot-ing scheme reduces the relative error of the previousstate-of-to-art by 30 to achieve 194 error

model error() 5 network votingerror()

No-Drop 226plusmn 0072 194Dropout 225plusmn 0034 196DropConnect 223plusmn 0039 194

Table 6 SVHN classification error The previous state-of-the-art is 28 (Zeiler and Fergus 2013)

64 NORB

In the final experiment we evaluate our models onthe 2-fold NORB (jittered-cluttered) dataset (LeCunet al 2004) a collection of stereo images of 3D mod-els For each image one of 6 classes appears on arandom background We train on 2-folds of 29 160images each and the test on a total of 58 320 imagesThe images are downsampled from 108times108 to 48times48as in (Ciresan et al 2012)

We use the same feature extractor as the largerCIFAR-10 experiment There is a 512 unit fully con-nected layer with relu activations placed between thesoftmax layer and feature extractor Rotation andscaling of the training data is applied but we do notcrop or flip the images as we found that to hurt per-

model error()5 network

votingerror()

No-Drop 448plusmn 078 336Dropout 396plusmn 016 303DropConnect 414plusmn 006 323

Table 7 NORM classification error for the jittered-cluttered dataset using 2 training folds The previousstate-of-art is 357 (Ciresan et al 2012)

formance on this dataset We trained with an initiallearning rate of 0001 and anneal for 100-40-10 epochs

In this experiment we beat the previous state-of-the-art result of 357 using No-Drop Dropout and Drop-Connect with our voting scheme While Dropout sur-passes DropConnect slightly both methods improveover No-Drop in this benchmark as shown in Table 7

7 Discussion

We have presented DropConnect which generalizesHinton et al rsquos Dropout (Hinton et al 2012) to the en-tire connectivity structure of a fully connected neuralnetwork layer We provide both theoretical justifica-tion and empirical results to show that DropConnecthelps regularize large neural network models Resultson a range of datasets show that DropConnect oftenoutperforms Dropout While our current implementa-tion of DropConnect is slightly slower than No-Drop orDropout in large models models the feature extractoris the bottleneck thus there is little difference in over-all training time DropConnect allows us to train largemodels while avoiding overfitting This yields state-of-the-art results on a variety of standard benchmarksusing our efficient GPU implementation of DropCon-nect

8 Appendix

81 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define the DropConnect networkas a mixture model o =

sumM p(M)f(x θM) =

EM [f(x θM)]

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters W are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more M is the DropConnect layermask

Now we reformulate the cross-entropy loss on top ofthe softmax into a single parameter function that com-bines the softmax output and labels as a logistic

Regularization of Neural Networks using DropConnect

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call the logisticloss function

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties 1) Ay(0) = ln k 2) minus1 le Aprimey(o) le 1 and3)Aprimeprimey(o) ge 0

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on setX and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

82 Bound Derivation

Lemma 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions and H = [Fj ]kj=1 be a k-

dimensional function class If A Rk rarr R is a Lips-chitz function with constant L and satisfies A(0) = 0then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1`

sum`i=1Ayi(oi) + 2kR`(F) + 3

radicln(2δ)

2`

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parametrized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof R`(HG) = Eσ[suphisinHgisinG

∣∣∣ 2` sum`i=1 σih g(xi)

∣∣∣]= Eσ

[supgisinGWleB

∣∣∣langW 2`

sum`i=1 σig(xi)

rang∣∣∣]le BEσ

[supfjisinF

∥∥∥∥[ 2` sum`i=1 σ

ji fj(xi)

]dj=1

∥∥∥∥]= BradicdEσ

[supfisinF

∣∣∣ 2` sum`i=1 σif(xi)

∣∣∣] =radicdBR`(F)

Remark 1 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 This

layer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on M then R`(EM [FM ]) le EM

[R`(FM )

]Proof R`(EM [FM ]) = R`

(sumM p (m)FM

)lesum

M R`(p(m)FM ) lesumM |p(m)|R`(FM ) = EM

[R`(FM )

]

Theorem 1 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ]) le EM[R`(f(x θM)

](6)

le (radicdkBs)

radicdEM

[R`(a hM g)

](7)

= 2radickdBsEM

[R`(hM g)

](8)

where hM = (M W )v Equation (6) is based onLemma 6 Equation (7) is based on Lemma 5 andEquation (8) follows from Lemma 4

EM[R`(hM g)

]= EMσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiWTDMg(xi)

∣∣∣∣∣]

(9)

= EMσ

[sup

hisinHgisinG

∣∣∣∣∣langDMW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxWDMW

]Eσ

supgjisinG

∥∥∥∥∥∥[

2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(10)

le Bhpradicnd(radic

nR`(G))

= pnradicdBhR`(G)

where DM in Equation (9) is an diagonal matrixwith diagonal elements equal to m and inner prod-uct properties lead to Equation (10) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Regularization of Neural Networks using DropConnect

References

D Ciresan U Meier and J Schmidhuber Multi-column deep neural networks for image classifica-tion In Proceedings of the 2012 IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR) CVPR rsquo12 pages 3642ndash3649 WashingtonDC USA 2012 IEEE Computer Society ISBN 978-1-4673-1226-4

G E Hinton N Srivastava A KrizhevskyI Sutskever and R Salakhutdinov Improving neu-ral networks by preventing co-adaptation of featuredetectors CoRR abs12070580 2012

A Krizhevsky Learning Multiple Layers of Featuresfrom Tiny Images Masterrsquos thesis University ofToront 2009

A Krizhevsky cuda-convnet httpcodegoogle

compcuda-convnet 2012

Y LeCun L Bottou Y Bengio and P HaffnerGradient-based learning applied to document recog-nition Proceedings of the IEEE 86(11)2278 ndash2324nov 1998 ISSN 0018-9219 doi 1011095726791

Y LeCun F J Huang and L Bottou Learning meth-ods for generic object recognition with invariance topose and lighting In Proceedings of the 2004 IEEEcomputer society conference on Computer vision andpattern recognition CVPRrsquo04 pages 97ndash104 Wash-ington DC USA 2004 IEEE Computer Society

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

D J C Mackay Probable networks and plausiblepredictions - a review of practical bayesian methodsfor supervised neural networks In Bayesian methodsfor backpropagation networks Springer 1995

V Nair and G E Hinton Rectified Linear Units Im-prove Restricted Boltzmann Machines In ICML2010

Y Netzer T Wang Coates A A Bissacco B Wuand A Y Ng Reading digits in natural images withunsupervised feature learning In NIPS Workshopon Deep Learning and Unsupervised Feature Learn-ing 2011 2011

J Snoek H Larochelle and R A Adams Practi-cal bayesian optimization of machine learning algo-rithms In Neural Information Processing Systems2012

A S Weigend D E Rumelhart and B A HubermanGeneralization by weight-elimination with applica-tion to forecasting In NIPS 1991

M D Zeiler and R Fergus Stochastic pooling forregualization of deep convolutional neural networksIn ICLR 2013

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

Regularization of Neural Networks using DropConnectSupplementary Material

1 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define DropConnect network as amixture model

o =summ

p(M)f(x θM) = Em [f(x θM)] (1)

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters w are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more m is the DropConnect layermask

Remark 1 when each element of Mi has equal proba-bility of being on and off (p = 05) the mixture modelhas equal weights for all sub-models f(x θM) oth-erwise the mixture model has larger weights in somesub-models than others

Reformulate cross-entropy loss on top of soft-max intoa single parameter function that combines soft-maxoutput and labels Same as logistic

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call logistic lossfunction

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties

1 Ay(0) = ln k2 minus1 le Aprimey(o) le 13 Aprimeprimey(o) ge 0

The first one says A(0) is depend on some constantrelated with number of labels The second one says Ais Lipschitz function with L = 1 The third one saysA is a convex function wrt x

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on set

X and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

Theorem 1 ((Koltchinskii and Panchenko 2000))Fix δ isin (0 1) and let F be a class of functions mappingfrom M to [0 1] Let (Mi)

`iminus1 be drawn independently

according to a probability distribution D Then withprobability at least 1minusδ over random draws of samplesof size ` every f isin F satisfies

E [f(M)] le E [f(M)] +R`(F) +

radicln (2δ)

2`

le E [f(M)] + R`(F) + 3

radicln (2δ)

2`

2 Bound Derivation

Theorem 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions If A R rarr R is Lipschitzwith constant L and satisfies A(0) = 0 then R`(A F ) le 2LR(F)

Lemma 2 Let F be class of real functions and H =[Fj ]kj=1 be a k-dimensional function class If A Rk rarrR is a Lipschitz function with constant L and satisfiesA(0) = 0 then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1

`

sumi=1

Ayi(oi) + 2kR`(F) + 3

radicln (2δ)

2`

Proof From Lemma 1 Logistic loss function (A minusc)(x) isin A due to (A minus c)prime(x) le 1 and (A minus c)(0) = 0with some constant c By Lemma 2 R`((Aminusc)F) le2kR`(F)

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Regularization of Neural Networks using DropConnect Supplementary Material

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parameterized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof

R`(H G)

= Eσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σih g(xi)

∣∣∣∣∣]

= Eσ

[sup

gisinGWleB

∣∣∣∣∣langW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le BEσ

supfjisinF

∥∥∥∥∥∥[

2

`

sumi=1

σji fj(xi)

]dj=1

∥∥∥∥∥∥

= BradicdEσ

[supfisinF

∣∣∣∣∣2` sumi=1

σif(xi)

∣∣∣∣∣]

=radicdBR`(F)

Remark 2 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 Thislayer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on m then R`(EM [FM ]) le EM

[R`(FM )

]Proof

R`(EM [FM ]) = R`

(sumM

p (M)FM

)lesumM

R`(p(M)FM )

lesumM

|p(M)|R`(FM ) = EM

[R`(FM )

]because of common fact 1) R`(cF) = |c|R`(F) and 2)R`(sumi Fi) le

sumi R`(Fi)

Theorem 3 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ])

le EM[R`(f(x θM)

](2)

= EM[R`(s a hm g)

]le (

radicdkBs)

radicdEM

[R`(a hm g)

](3)

= 2radickdBsEM

[R`(hm g)

](4)

where hm = (M W )v Equation (2) is based onLemma 6 Equation (3) is based on Lemma 5 andEquation (4) follows from Lemma 4

EM[R`(hm g)

]= Emσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiwTDMg(xi)

∣∣∣∣∣]

(5)

= Emσ

[sup

hisinHgisinG

∣∣∣∣∣langDMw

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxwDMw

]Eσ

supgjisinG

∥∥∥∥∥∥[2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(6)

le Bhpradicnd(radic

nR`(G))= pn

radicdBhR`(G)

where DM in Equation (5) is an diagonal matrix withdiagonal elements equal to m and inner product prop-erties lead to Equation (6) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Remark 3 Theorem 3 implies that p is an additionalregularizer we have added to network when we converta normal neural network to a network with DropCon-nect layers Consider the following extreme cases

1 p = 0 the network generalization bound equals to0 which is true because classifier does not depends oninput any more

2 p = 1 reduce to normal network

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Regularization of Neural Networks using DropConnect Supplementary Material

Symbol Description Related Formulay Data Label can either be integer label for bit vec-

tor(depends on context)x Network input datag() Feature extractor function with parameter Wg

v Feature extractor network output v = g(xWg)M DropConnect connection information parameter

(weight mask)h() DropConnect transformation function with parame-

ter WMu DropConnect output u = h(vWM)a() DropConnect activation functionr DropConnect after activation r = a(u)s() Dimension reduction layer function with parameter

Ws

o Dimension reduction layer output (network output) o = s(rWs)θ All parameter of network expect weight mask θ = WsWWgf() Overall classifier(network) output o = f(x θM)λ Weight penaltyA() Data Loss Function A(ominus y)L() Over all objective function L(x y) =

sumiA(oi minus yi) + 12λW22

n Dimension of feature extractor outputd Dimension of DropConnect layer outputk number of class dim(y) = k

Table 1 Symbol Table

References

V Koltchinskii and D Panchenko Empirical margindistributions and bounding the generalization errorof combined classifiers Annals of Statistics 3020022000

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

Page 6: Regularization of Neural Networks using …yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdfRegularization of Neural Networks using DropConnect DropConnect weights W (d x n) b) DropConnect

Regularization of Neural Networks using DropConnect

200 400 800 160011

12

13

14

15

16

17

18

19

2

21

Hidden Units

T

est E

rror

NominusDropDropoutDropConnect

0 01 02 03 04 05 06 07 08 0912

14

16

18

2

22

24

of Elements Kept

T

est E

rror

Dropout (mean)DropConnect (mean)Dropout (sampling)DropConnect (sampling)

100 200 300 400 500 600 700 800 90010minus3

10minus2

Epoch

Cro

ss E

ntro

py

NominusDrop TrainNominusDrop TestDropout TrainDropout TestDropConnect TrainDropConnect Test

Figure 2 Using the MNIST dataset in a) we analyze the ability of Dropout and DropConnect to prevent overfittingas the size of the 2 fully connected layers increase b) Varying the drop-rate in a 400-400 network shows near optimalperformance around the p = 05 proposed by (Hinton et al 2012) c) we show the convergence properties of the traintestsets See text for discussion

learning rate of 001 with a 700-200-100 epoch sched-ule no momentum and preprocess by subtracting theimage mean

crop rotationscaling

model error()5 network

votingerror()

no no No-Drop 077plusmn0051 067Dropout 059plusmn0039 052DropConnect 063plusmn0035 057

yes no No-Drop 050plusmn0098 038Dropout 039plusmn0039 035DropConnect 039plusmn0047 032

yes yes No-Drop 030plusmn0035 021Dropout 028plusmn0016 027DropConnect 028plusmn0032 021

Table 3 MNIST classification error Previous state of theart is 0 47 (Zeiler and Fergus 2013) for a single modelwithout elastic distortions and 023 with elastic distor-tions and voting (Ciresan et al 2012)

We note that our approach surpasses the state-of-the-art result of 023 (Ciresan et al 2012) achieving a021 error rate without the use of elastic distortions(as used by (Ciresan et al 2012))

62 CIFAR-10

CIFAR-10 is a data set of natural 32x32 RGB images(Krizhevsky 2009) in 10-classes with 50 000 imagesfor training and 10 000 for testing Before inputtingthese images to our network we subtract the per-pixelmean computed over the training set from each image

The first experiment on CIFAR-10 (summarized inTable 4) uses the simple convolutional network fea-ture extractor described in (Krizhevsky 2012)(layers-80seccfg) that is designed for rapid training ratherthan optimal performance On top of the 3-layerfeature extractor we have a 64 unit fully connectedlayer which uses No-Drop Dropout or DropConnectNo data augmentation is utilized for this experiment

Since this experiment is not aimed at optimal perfor-mance we report a single modelrsquos performance with-out voting We train for 150-0-0 epochs with an ini-tial learning rate of 0001 and their default weight de-cay DropConnect prevents overfitting of the fully con-nected layer better than Dropout in this experiment

model error()No-Drop 235Dropout 197DropConnect 187

Table 4 CIFAR-10 classification error using the simplefeature extractor described in (Krizhevsky 2012)(layers-80seccfg) and with no data augmentation

Table 5 shows classification results of the network us-ing a larger feature extractor with 2 convolutionallayers and 2 locally connected layers as describedin (Krizhevsky 2012)(layers-conv-local-11pctcfg) A128 neuron fully connected layer with relu activationsis added between the softmax layer and feature extrac-tor Following (Krizhevsky 2012) images are croppedto 24x24 with horizontal flips and no rotation or scal-ing is performed We use an initial learning rate of0001 and train for 700-300-50 epochs with their de-fault weight decay Model voting significantly im-proves performance when using Dropout or DropCon-nect the latter reaching an error rate of 941 Ad-ditionally we trained a model with 12 networks withDropConnect and achieved a state-of-the-art result of932 indicating the power of our approach

63 SVHN

The Street View House Numbers (SVHN) dataset in-cludes 604 388 images (both training set and extra set)and 26 032 testing images (Netzer et al 2011) Simi-lar to MNIST the goal is to classify the digit centeredin each 32x32 RGB image Due to the large variety ofcolors and brightness variations in the images we pre-

Regularization of Neural Networks using DropConnect

model error() 5 network votingerror()

No-Drop 1118plusmn 013 1022Dropout 1152plusmn 018 983DropConnect 1110plusmn 013 941

Table 5 CIFAR-10 classification error using a larger fea-ture extractor Previous state-of-the-art is 95 (Snoeket al 2012) Voting with 12 DropConnect networks pro-duces an error rate of 932 significantly beating thestate-of-the-art

process the images using local contrast normalizationas in (Zeiler and Fergus 2013) The feature extractoris the same as the larger CIFAR-10 experiment butwe instead use a larger 512 unit fully connected layerwith relu activations between the softmax layer andthe feature extractor After contrast normalizing thetraining data is randomly cropped to 28 times 28 pixelsand is rotated and scaled We do not do horizontalflips Table 6 shows the classification performance for5 models trained with an initial learning rate of 0001for a 100-50-10 epoch schedule

Due to the large training set size both Dropout andDropConnect achieve nearly the same performance asNo-Drop However using our data augmentation tech-niques and careful annealing the per model scores eas-ily surpass the previous 280 state-of-the-art resultof (Zeiler and Fergus 2013) Furthermore our vot-ing scheme reduces the relative error of the previousstate-of-to-art by 30 to achieve 194 error

model error() 5 network votingerror()

No-Drop 226plusmn 0072 194Dropout 225plusmn 0034 196DropConnect 223plusmn 0039 194

Table 6 SVHN classification error The previous state-of-the-art is 28 (Zeiler and Fergus 2013)

64 NORB

In the final experiment we evaluate our models onthe 2-fold NORB (jittered-cluttered) dataset (LeCunet al 2004) a collection of stereo images of 3D mod-els For each image one of 6 classes appears on arandom background We train on 2-folds of 29 160images each and the test on a total of 58 320 imagesThe images are downsampled from 108times108 to 48times48as in (Ciresan et al 2012)

We use the same feature extractor as the largerCIFAR-10 experiment There is a 512 unit fully con-nected layer with relu activations placed between thesoftmax layer and feature extractor Rotation andscaling of the training data is applied but we do notcrop or flip the images as we found that to hurt per-

model error()5 network

votingerror()

No-Drop 448plusmn 078 336Dropout 396plusmn 016 303DropConnect 414plusmn 006 323

Table 7 NORM classification error for the jittered-cluttered dataset using 2 training folds The previousstate-of-art is 357 (Ciresan et al 2012)

formance on this dataset We trained with an initiallearning rate of 0001 and anneal for 100-40-10 epochs

In this experiment we beat the previous state-of-the-art result of 357 using No-Drop Dropout and Drop-Connect with our voting scheme While Dropout sur-passes DropConnect slightly both methods improveover No-Drop in this benchmark as shown in Table 7

7 Discussion

We have presented DropConnect which generalizesHinton et al rsquos Dropout (Hinton et al 2012) to the en-tire connectivity structure of a fully connected neuralnetwork layer We provide both theoretical justifica-tion and empirical results to show that DropConnecthelps regularize large neural network models Resultson a range of datasets show that DropConnect oftenoutperforms Dropout While our current implementa-tion of DropConnect is slightly slower than No-Drop orDropout in large models models the feature extractoris the bottleneck thus there is little difference in over-all training time DropConnect allows us to train largemodels while avoiding overfitting This yields state-of-the-art results on a variety of standard benchmarksusing our efficient GPU implementation of DropCon-nect

8 Appendix

81 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define the DropConnect networkas a mixture model o =

sumM p(M)f(x θM) =

EM [f(x θM)]

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters W are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more M is the DropConnect layermask

Now we reformulate the cross-entropy loss on top ofthe softmax into a single parameter function that com-bines the softmax output and labels as a logistic

Regularization of Neural Networks using DropConnect

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call the logisticloss function

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties 1) Ay(0) = ln k 2) minus1 le Aprimey(o) le 1 and3)Aprimeprimey(o) ge 0

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on setX and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

82 Bound Derivation

Lemma 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions and H = [Fj ]kj=1 be a k-

dimensional function class If A Rk rarr R is a Lips-chitz function with constant L and satisfies A(0) = 0then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1`

sum`i=1Ayi(oi) + 2kR`(F) + 3

radicln(2δ)

2`

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parametrized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof R`(HG) = Eσ[suphisinHgisinG

∣∣∣ 2` sum`i=1 σih g(xi)

∣∣∣]= Eσ

[supgisinGWleB

∣∣∣langW 2`

sum`i=1 σig(xi)

rang∣∣∣]le BEσ

[supfjisinF

∥∥∥∥[ 2` sum`i=1 σ

ji fj(xi)

]dj=1

∥∥∥∥]= BradicdEσ

[supfisinF

∣∣∣ 2` sum`i=1 σif(xi)

∣∣∣] =radicdBR`(F)

Remark 1 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 This

layer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on M then R`(EM [FM ]) le EM

[R`(FM )

]Proof R`(EM [FM ]) = R`

(sumM p (m)FM

)lesum

M R`(p(m)FM ) lesumM |p(m)|R`(FM ) = EM

[R`(FM )

]

Theorem 1 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ]) le EM[R`(f(x θM)

](6)

le (radicdkBs)

radicdEM

[R`(a hM g)

](7)

= 2radickdBsEM

[R`(hM g)

](8)

where hM = (M W )v Equation (6) is based onLemma 6 Equation (7) is based on Lemma 5 andEquation (8) follows from Lemma 4

EM[R`(hM g)

]= EMσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiWTDMg(xi)

∣∣∣∣∣]

(9)

= EMσ

[sup

hisinHgisinG

∣∣∣∣∣langDMW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxWDMW

]Eσ

supgjisinG

∥∥∥∥∥∥[

2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(10)

le Bhpradicnd(radic

nR`(G))

= pnradicdBhR`(G)

where DM in Equation (9) is an diagonal matrixwith diagonal elements equal to m and inner prod-uct properties lead to Equation (10) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Regularization of Neural Networks using DropConnect

References

D Ciresan U Meier and J Schmidhuber Multi-column deep neural networks for image classifica-tion In Proceedings of the 2012 IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR) CVPR rsquo12 pages 3642ndash3649 WashingtonDC USA 2012 IEEE Computer Society ISBN 978-1-4673-1226-4

G E Hinton N Srivastava A KrizhevskyI Sutskever and R Salakhutdinov Improving neu-ral networks by preventing co-adaptation of featuredetectors CoRR abs12070580 2012

A Krizhevsky Learning Multiple Layers of Featuresfrom Tiny Images Masterrsquos thesis University ofToront 2009

A Krizhevsky cuda-convnet httpcodegoogle

compcuda-convnet 2012

Y LeCun L Bottou Y Bengio and P HaffnerGradient-based learning applied to document recog-nition Proceedings of the IEEE 86(11)2278 ndash2324nov 1998 ISSN 0018-9219 doi 1011095726791

Y LeCun F J Huang and L Bottou Learning meth-ods for generic object recognition with invariance topose and lighting In Proceedings of the 2004 IEEEcomputer society conference on Computer vision andpattern recognition CVPRrsquo04 pages 97ndash104 Wash-ington DC USA 2004 IEEE Computer Society

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

D J C Mackay Probable networks and plausiblepredictions - a review of practical bayesian methodsfor supervised neural networks In Bayesian methodsfor backpropagation networks Springer 1995

V Nair and G E Hinton Rectified Linear Units Im-prove Restricted Boltzmann Machines In ICML2010

Y Netzer T Wang Coates A A Bissacco B Wuand A Y Ng Reading digits in natural images withunsupervised feature learning In NIPS Workshopon Deep Learning and Unsupervised Feature Learn-ing 2011 2011

J Snoek H Larochelle and R A Adams Practi-cal bayesian optimization of machine learning algo-rithms In Neural Information Processing Systems2012

A S Weigend D E Rumelhart and B A HubermanGeneralization by weight-elimination with applica-tion to forecasting In NIPS 1991

M D Zeiler and R Fergus Stochastic pooling forregualization of deep convolutional neural networksIn ICLR 2013

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

Regularization of Neural Networks using DropConnectSupplementary Material

1 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define DropConnect network as amixture model

o =summ

p(M)f(x θM) = Em [f(x θM)] (1)

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters w are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more m is the DropConnect layermask

Remark 1 when each element of Mi has equal proba-bility of being on and off (p = 05) the mixture modelhas equal weights for all sub-models f(x θM) oth-erwise the mixture model has larger weights in somesub-models than others

Reformulate cross-entropy loss on top of soft-max intoa single parameter function that combines soft-maxoutput and labels Same as logistic

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call logistic lossfunction

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties

1 Ay(0) = ln k2 minus1 le Aprimey(o) le 13 Aprimeprimey(o) ge 0

The first one says A(0) is depend on some constantrelated with number of labels The second one says Ais Lipschitz function with L = 1 The third one saysA is a convex function wrt x

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on set

X and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

Theorem 1 ((Koltchinskii and Panchenko 2000))Fix δ isin (0 1) and let F be a class of functions mappingfrom M to [0 1] Let (Mi)

`iminus1 be drawn independently

according to a probability distribution D Then withprobability at least 1minusδ over random draws of samplesof size ` every f isin F satisfies

E [f(M)] le E [f(M)] +R`(F) +

radicln (2δ)

2`

le E [f(M)] + R`(F) + 3

radicln (2δ)

2`

2 Bound Derivation

Theorem 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions If A R rarr R is Lipschitzwith constant L and satisfies A(0) = 0 then R`(A F ) le 2LR(F)

Lemma 2 Let F be class of real functions and H =[Fj ]kj=1 be a k-dimensional function class If A Rk rarrR is a Lipschitz function with constant L and satisfiesA(0) = 0 then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1

`

sumi=1

Ayi(oi) + 2kR`(F) + 3

radicln (2δ)

2`

Proof From Lemma 1 Logistic loss function (A minusc)(x) isin A due to (A minus c)prime(x) le 1 and (A minus c)(0) = 0with some constant c By Lemma 2 R`((Aminusc)F) le2kR`(F)

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Regularization of Neural Networks using DropConnect Supplementary Material

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parameterized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof

R`(H G)

= Eσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σih g(xi)

∣∣∣∣∣]

= Eσ

[sup

gisinGWleB

∣∣∣∣∣langW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le BEσ

supfjisinF

∥∥∥∥∥∥[

2

`

sumi=1

σji fj(xi)

]dj=1

∥∥∥∥∥∥

= BradicdEσ

[supfisinF

∣∣∣∣∣2` sumi=1

σif(xi)

∣∣∣∣∣]

=radicdBR`(F)

Remark 2 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 Thislayer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on m then R`(EM [FM ]) le EM

[R`(FM )

]Proof

R`(EM [FM ]) = R`

(sumM

p (M)FM

)lesumM

R`(p(M)FM )

lesumM

|p(M)|R`(FM ) = EM

[R`(FM )

]because of common fact 1) R`(cF) = |c|R`(F) and 2)R`(sumi Fi) le

sumi R`(Fi)

Theorem 3 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ])

le EM[R`(f(x θM)

](2)

= EM[R`(s a hm g)

]le (

radicdkBs)

radicdEM

[R`(a hm g)

](3)

= 2radickdBsEM

[R`(hm g)

](4)

where hm = (M W )v Equation (2) is based onLemma 6 Equation (3) is based on Lemma 5 andEquation (4) follows from Lemma 4

EM[R`(hm g)

]= Emσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiwTDMg(xi)

∣∣∣∣∣]

(5)

= Emσ

[sup

hisinHgisinG

∣∣∣∣∣langDMw

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxwDMw

]Eσ

supgjisinG

∥∥∥∥∥∥[2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(6)

le Bhpradicnd(radic

nR`(G))= pn

radicdBhR`(G)

where DM in Equation (5) is an diagonal matrix withdiagonal elements equal to m and inner product prop-erties lead to Equation (6) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Remark 3 Theorem 3 implies that p is an additionalregularizer we have added to network when we converta normal neural network to a network with DropCon-nect layers Consider the following extreme cases

1 p = 0 the network generalization bound equals to0 which is true because classifier does not depends oninput any more

2 p = 1 reduce to normal network

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Regularization of Neural Networks using DropConnect Supplementary Material

Symbol Description Related Formulay Data Label can either be integer label for bit vec-

tor(depends on context)x Network input datag() Feature extractor function with parameter Wg

v Feature extractor network output v = g(xWg)M DropConnect connection information parameter

(weight mask)h() DropConnect transformation function with parame-

ter WMu DropConnect output u = h(vWM)a() DropConnect activation functionr DropConnect after activation r = a(u)s() Dimension reduction layer function with parameter

Ws

o Dimension reduction layer output (network output) o = s(rWs)θ All parameter of network expect weight mask θ = WsWWgf() Overall classifier(network) output o = f(x θM)λ Weight penaltyA() Data Loss Function A(ominus y)L() Over all objective function L(x y) =

sumiA(oi minus yi) + 12λW22

n Dimension of feature extractor outputd Dimension of DropConnect layer outputk number of class dim(y) = k

Table 1 Symbol Table

References

V Koltchinskii and D Panchenko Empirical margindistributions and bounding the generalization errorof combined classifiers Annals of Statistics 3020022000

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

Page 7: Regularization of Neural Networks using …yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdfRegularization of Neural Networks using DropConnect DropConnect weights W (d x n) b) DropConnect

Regularization of Neural Networks using DropConnect

model error() 5 network votingerror()

No-Drop 1118plusmn 013 1022Dropout 1152plusmn 018 983DropConnect 1110plusmn 013 941

Table 5 CIFAR-10 classification error using a larger fea-ture extractor Previous state-of-the-art is 95 (Snoeket al 2012) Voting with 12 DropConnect networks pro-duces an error rate of 932 significantly beating thestate-of-the-art

process the images using local contrast normalizationas in (Zeiler and Fergus 2013) The feature extractoris the same as the larger CIFAR-10 experiment butwe instead use a larger 512 unit fully connected layerwith relu activations between the softmax layer andthe feature extractor After contrast normalizing thetraining data is randomly cropped to 28 times 28 pixelsand is rotated and scaled We do not do horizontalflips Table 6 shows the classification performance for5 models trained with an initial learning rate of 0001for a 100-50-10 epoch schedule

Due to the large training set size both Dropout andDropConnect achieve nearly the same performance asNo-Drop However using our data augmentation tech-niques and careful annealing the per model scores eas-ily surpass the previous 280 state-of-the-art resultof (Zeiler and Fergus 2013) Furthermore our vot-ing scheme reduces the relative error of the previousstate-of-to-art by 30 to achieve 194 error

model error() 5 network votingerror()

No-Drop 226plusmn 0072 194Dropout 225plusmn 0034 196DropConnect 223plusmn 0039 194

Table 6 SVHN classification error The previous state-of-the-art is 28 (Zeiler and Fergus 2013)

64 NORB

In the final experiment we evaluate our models onthe 2-fold NORB (jittered-cluttered) dataset (LeCunet al 2004) a collection of stereo images of 3D mod-els For each image one of 6 classes appears on arandom background We train on 2-folds of 29 160images each and the test on a total of 58 320 imagesThe images are downsampled from 108times108 to 48times48as in (Ciresan et al 2012)

We use the same feature extractor as the largerCIFAR-10 experiment There is a 512 unit fully con-nected layer with relu activations placed between thesoftmax layer and feature extractor Rotation andscaling of the training data is applied but we do notcrop or flip the images as we found that to hurt per-

model error()5 network

votingerror()

No-Drop 448plusmn 078 336Dropout 396plusmn 016 303DropConnect 414plusmn 006 323

Table 7 NORM classification error for the jittered-cluttered dataset using 2 training folds The previousstate-of-art is 357 (Ciresan et al 2012)

formance on this dataset We trained with an initiallearning rate of 0001 and anneal for 100-40-10 epochs

In this experiment we beat the previous state-of-the-art result of 357 using No-Drop Dropout and Drop-Connect with our voting scheme While Dropout sur-passes DropConnect slightly both methods improveover No-Drop in this benchmark as shown in Table 7

7 Discussion

We have presented DropConnect which generalizesHinton et al rsquos Dropout (Hinton et al 2012) to the en-tire connectivity structure of a fully connected neuralnetwork layer We provide both theoretical justifica-tion and empirical results to show that DropConnecthelps regularize large neural network models Resultson a range of datasets show that DropConnect oftenoutperforms Dropout While our current implementa-tion of DropConnect is slightly slower than No-Drop orDropout in large models models the feature extractoris the bottleneck thus there is little difference in over-all training time DropConnect allows us to train largemodels while avoiding overfitting This yields state-of-the-art results on a variety of standard benchmarksusing our efficient GPU implementation of DropCon-nect

8 Appendix

81 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define the DropConnect networkas a mixture model o =

sumM p(M)f(x θM) =

EM [f(x θM)]

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters W are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more M is the DropConnect layermask

Now we reformulate the cross-entropy loss on top ofthe softmax into a single parameter function that com-bines the softmax output and labels as a logistic

Regularization of Neural Networks using DropConnect

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call the logisticloss function

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties 1) Ay(0) = ln k 2) minus1 le Aprimey(o) le 1 and3)Aprimeprimey(o) ge 0

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on setX and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

82 Bound Derivation

Lemma 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions and H = [Fj ]kj=1 be a k-

dimensional function class If A Rk rarr R is a Lips-chitz function with constant L and satisfies A(0) = 0then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1`

sum`i=1Ayi(oi) + 2kR`(F) + 3

radicln(2δ)

2`

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parametrized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof R`(HG) = Eσ[suphisinHgisinG

∣∣∣ 2` sum`i=1 σih g(xi)

∣∣∣]= Eσ

[supgisinGWleB

∣∣∣langW 2`

sum`i=1 σig(xi)

rang∣∣∣]le BEσ

[supfjisinF

∥∥∥∥[ 2` sum`i=1 σ

ji fj(xi)

]dj=1

∥∥∥∥]= BradicdEσ

[supfisinF

∣∣∣ 2` sum`i=1 σif(xi)

∣∣∣] =radicdBR`(F)

Remark 1 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 This

layer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on M then R`(EM [FM ]) le EM

[R`(FM )

]Proof R`(EM [FM ]) = R`

(sumM p (m)FM

)lesum

M R`(p(m)FM ) lesumM |p(m)|R`(FM ) = EM

[R`(FM )

]

Theorem 1 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ]) le EM[R`(f(x θM)

](6)

le (radicdkBs)

radicdEM

[R`(a hM g)

](7)

= 2radickdBsEM

[R`(hM g)

](8)

where hM = (M W )v Equation (6) is based onLemma 6 Equation (7) is based on Lemma 5 andEquation (8) follows from Lemma 4

EM[R`(hM g)

]= EMσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiWTDMg(xi)

∣∣∣∣∣]

(9)

= EMσ

[sup

hisinHgisinG

∣∣∣∣∣langDMW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxWDMW

]Eσ

supgjisinG

∥∥∥∥∥∥[

2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(10)

le Bhpradicnd(radic

nR`(G))

= pnradicdBhR`(G)

where DM in Equation (9) is an diagonal matrixwith diagonal elements equal to m and inner prod-uct properties lead to Equation (10) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Regularization of Neural Networks using DropConnect

References

D Ciresan U Meier and J Schmidhuber Multi-column deep neural networks for image classifica-tion In Proceedings of the 2012 IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR) CVPR rsquo12 pages 3642ndash3649 WashingtonDC USA 2012 IEEE Computer Society ISBN 978-1-4673-1226-4

G E Hinton N Srivastava A KrizhevskyI Sutskever and R Salakhutdinov Improving neu-ral networks by preventing co-adaptation of featuredetectors CoRR abs12070580 2012

A Krizhevsky Learning Multiple Layers of Featuresfrom Tiny Images Masterrsquos thesis University ofToront 2009

A Krizhevsky cuda-convnet httpcodegoogle

compcuda-convnet 2012

Y LeCun L Bottou Y Bengio and P HaffnerGradient-based learning applied to document recog-nition Proceedings of the IEEE 86(11)2278 ndash2324nov 1998 ISSN 0018-9219 doi 1011095726791

Y LeCun F J Huang and L Bottou Learning meth-ods for generic object recognition with invariance topose and lighting In Proceedings of the 2004 IEEEcomputer society conference on Computer vision andpattern recognition CVPRrsquo04 pages 97ndash104 Wash-ington DC USA 2004 IEEE Computer Society

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

D J C Mackay Probable networks and plausiblepredictions - a review of practical bayesian methodsfor supervised neural networks In Bayesian methodsfor backpropagation networks Springer 1995

V Nair and G E Hinton Rectified Linear Units Im-prove Restricted Boltzmann Machines In ICML2010

Y Netzer T Wang Coates A A Bissacco B Wuand A Y Ng Reading digits in natural images withunsupervised feature learning In NIPS Workshopon Deep Learning and Unsupervised Feature Learn-ing 2011 2011

J Snoek H Larochelle and R A Adams Practi-cal bayesian optimization of machine learning algo-rithms In Neural Information Processing Systems2012

A S Weigend D E Rumelhart and B A HubermanGeneralization by weight-elimination with applica-tion to forecasting In NIPS 1991

M D Zeiler and R Fergus Stochastic pooling forregualization of deep convolutional neural networksIn ICLR 2013

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

Regularization of Neural Networks using DropConnectSupplementary Material

1 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define DropConnect network as amixture model

o =summ

p(M)f(x θM) = Em [f(x θM)] (1)

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters w are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more m is the DropConnect layermask

Remark 1 when each element of Mi has equal proba-bility of being on and off (p = 05) the mixture modelhas equal weights for all sub-models f(x θM) oth-erwise the mixture model has larger weights in somesub-models than others

Reformulate cross-entropy loss on top of soft-max intoa single parameter function that combines soft-maxoutput and labels Same as logistic

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call logistic lossfunction

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties

1 Ay(0) = ln k2 minus1 le Aprimey(o) le 13 Aprimeprimey(o) ge 0

The first one says A(0) is depend on some constantrelated with number of labels The second one says Ais Lipschitz function with L = 1 The third one saysA is a convex function wrt x

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on set

X and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

Theorem 1 ((Koltchinskii and Panchenko 2000))Fix δ isin (0 1) and let F be a class of functions mappingfrom M to [0 1] Let (Mi)

`iminus1 be drawn independently

according to a probability distribution D Then withprobability at least 1minusδ over random draws of samplesof size ` every f isin F satisfies

E [f(M)] le E [f(M)] +R`(F) +

radicln (2δ)

2`

le E [f(M)] + R`(F) + 3

radicln (2δ)

2`

2 Bound Derivation

Theorem 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions If A R rarr R is Lipschitzwith constant L and satisfies A(0) = 0 then R`(A F ) le 2LR(F)

Lemma 2 Let F be class of real functions and H =[Fj ]kj=1 be a k-dimensional function class If A Rk rarrR is a Lipschitz function with constant L and satisfiesA(0) = 0 then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1

`

sumi=1

Ayi(oi) + 2kR`(F) + 3

radicln (2δ)

2`

Proof From Lemma 1 Logistic loss function (A minusc)(x) isin A due to (A minus c)prime(x) le 1 and (A minus c)(0) = 0with some constant c By Lemma 2 R`((Aminusc)F) le2kR`(F)

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Regularization of Neural Networks using DropConnect Supplementary Material

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parameterized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof

R`(H G)

= Eσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σih g(xi)

∣∣∣∣∣]

= Eσ

[sup

gisinGWleB

∣∣∣∣∣langW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le BEσ

supfjisinF

∥∥∥∥∥∥[

2

`

sumi=1

σji fj(xi)

]dj=1

∥∥∥∥∥∥

= BradicdEσ

[supfisinF

∣∣∣∣∣2` sumi=1

σif(xi)

∣∣∣∣∣]

=radicdBR`(F)

Remark 2 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 Thislayer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on m then R`(EM [FM ]) le EM

[R`(FM )

]Proof

R`(EM [FM ]) = R`

(sumM

p (M)FM

)lesumM

R`(p(M)FM )

lesumM

|p(M)|R`(FM ) = EM

[R`(FM )

]because of common fact 1) R`(cF) = |c|R`(F) and 2)R`(sumi Fi) le

sumi R`(Fi)

Theorem 3 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ])

le EM[R`(f(x θM)

](2)

= EM[R`(s a hm g)

]le (

radicdkBs)

radicdEM

[R`(a hm g)

](3)

= 2radickdBsEM

[R`(hm g)

](4)

where hm = (M W )v Equation (2) is based onLemma 6 Equation (3) is based on Lemma 5 andEquation (4) follows from Lemma 4

EM[R`(hm g)

]= Emσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiwTDMg(xi)

∣∣∣∣∣]

(5)

= Emσ

[sup

hisinHgisinG

∣∣∣∣∣langDMw

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxwDMw

]Eσ

supgjisinG

∥∥∥∥∥∥[2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(6)

le Bhpradicnd(radic

nR`(G))= pn

radicdBhR`(G)

where DM in Equation (5) is an diagonal matrix withdiagonal elements equal to m and inner product prop-erties lead to Equation (6) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Remark 3 Theorem 3 implies that p is an additionalregularizer we have added to network when we converta normal neural network to a network with DropCon-nect layers Consider the following extreme cases

1 p = 0 the network generalization bound equals to0 which is true because classifier does not depends oninput any more

2 p = 1 reduce to normal network

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Regularization of Neural Networks using DropConnect Supplementary Material

Symbol Description Related Formulay Data Label can either be integer label for bit vec-

tor(depends on context)x Network input datag() Feature extractor function with parameter Wg

v Feature extractor network output v = g(xWg)M DropConnect connection information parameter

(weight mask)h() DropConnect transformation function with parame-

ter WMu DropConnect output u = h(vWM)a() DropConnect activation functionr DropConnect after activation r = a(u)s() Dimension reduction layer function with parameter

Ws

o Dimension reduction layer output (network output) o = s(rWs)θ All parameter of network expect weight mask θ = WsWWgf() Overall classifier(network) output o = f(x θM)λ Weight penaltyA() Data Loss Function A(ominus y)L() Over all objective function L(x y) =

sumiA(oi minus yi) + 12λW22

n Dimension of feature extractor outputd Dimension of DropConnect layer outputk number of class dim(y) = k

Table 1 Symbol Table

References

V Koltchinskii and D Panchenko Empirical margindistributions and bounding the generalization errorof combined classifiers Annals of Statistics 3020022000

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

Page 8: Regularization of Neural Networks using …yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdfRegularization of Neural Networks using DropConnect DropConnect weights W (d x n) b) DropConnect

Regularization of Neural Networks using DropConnect

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call the logisticloss function

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties 1) Ay(0) = ln k 2) minus1 le Aprimey(o) le 1 and3)Aprimeprimey(o) ge 0

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on setX and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

82 Bound Derivation

Lemma 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions and H = [Fj ]kj=1 be a k-

dimensional function class If A Rk rarr R is a Lips-chitz function with constant L and satisfies A(0) = 0then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1`

sum`i=1Ayi(oi) + 2kR`(F) + 3

radicln(2δ)

2`

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parametrized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof R`(HG) = Eσ[suphisinHgisinG

∣∣∣ 2` sum`i=1 σih g(xi)

∣∣∣]= Eσ

[supgisinGWleB

∣∣∣langW 2`

sum`i=1 σig(xi)

rang∣∣∣]le BEσ

[supfjisinF

∥∥∥∥[ 2` sum`i=1 σ

ji fj(xi)

]dj=1

∥∥∥∥]= BradicdEσ

[supfisinF

∣∣∣ 2` sum`i=1 σif(xi)

∣∣∣] =radicdBR`(F)

Remark 1 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 This

layer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on M then R`(EM [FM ]) le EM

[R`(FM )

]Proof R`(EM [FM ]) = R`

(sumM p (m)FM

)lesum

M R`(p(m)FM ) lesumM |p(m)|R`(FM ) = EM

[R`(FM )

]

Theorem 1 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ]) le EM[R`(f(x θM)

](6)

le (radicdkBs)

radicdEM

[R`(a hM g)

](7)

= 2radickdBsEM

[R`(hM g)

](8)

where hM = (M W )v Equation (6) is based onLemma 6 Equation (7) is based on Lemma 5 andEquation (8) follows from Lemma 4

EM[R`(hM g)

]= EMσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiWTDMg(xi)

∣∣∣∣∣]

(9)

= EMσ

[sup

hisinHgisinG

∣∣∣∣∣langDMW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxWDMW

]Eσ

supgjisinG

∥∥∥∥∥∥[

2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(10)

le Bhpradicnd(radic

nR`(G))

= pnradicdBhR`(G)

where DM in Equation (9) is an diagonal matrixwith diagonal elements equal to m and inner prod-uct properties lead to Equation (10) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Regularization of Neural Networks using DropConnect

References

D Ciresan U Meier and J Schmidhuber Multi-column deep neural networks for image classifica-tion In Proceedings of the 2012 IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR) CVPR rsquo12 pages 3642ndash3649 WashingtonDC USA 2012 IEEE Computer Society ISBN 978-1-4673-1226-4

G E Hinton N Srivastava A KrizhevskyI Sutskever and R Salakhutdinov Improving neu-ral networks by preventing co-adaptation of featuredetectors CoRR abs12070580 2012

A Krizhevsky Learning Multiple Layers of Featuresfrom Tiny Images Masterrsquos thesis University ofToront 2009

A Krizhevsky cuda-convnet httpcodegoogle

compcuda-convnet 2012

Y LeCun L Bottou Y Bengio and P HaffnerGradient-based learning applied to document recog-nition Proceedings of the IEEE 86(11)2278 ndash2324nov 1998 ISSN 0018-9219 doi 1011095726791

Y LeCun F J Huang and L Bottou Learning meth-ods for generic object recognition with invariance topose and lighting In Proceedings of the 2004 IEEEcomputer society conference on Computer vision andpattern recognition CVPRrsquo04 pages 97ndash104 Wash-ington DC USA 2004 IEEE Computer Society

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

D J C Mackay Probable networks and plausiblepredictions - a review of practical bayesian methodsfor supervised neural networks In Bayesian methodsfor backpropagation networks Springer 1995

V Nair and G E Hinton Rectified Linear Units Im-prove Restricted Boltzmann Machines In ICML2010

Y Netzer T Wang Coates A A Bissacco B Wuand A Y Ng Reading digits in natural images withunsupervised feature learning In NIPS Workshopon Deep Learning and Unsupervised Feature Learn-ing 2011 2011

J Snoek H Larochelle and R A Adams Practi-cal bayesian optimization of machine learning algo-rithms In Neural Information Processing Systems2012

A S Weigend D E Rumelhart and B A HubermanGeneralization by weight-elimination with applica-tion to forecasting In NIPS 1991

M D Zeiler and R Fergus Stochastic pooling forregualization of deep convolutional neural networksIn ICLR 2013

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

Regularization of Neural Networks using DropConnectSupplementary Material

1 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define DropConnect network as amixture model

o =summ

p(M)f(x θM) = Em [f(x θM)] (1)

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters w are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more m is the DropConnect layermask

Remark 1 when each element of Mi has equal proba-bility of being on and off (p = 05) the mixture modelhas equal weights for all sub-models f(x θM) oth-erwise the mixture model has larger weights in somesub-models than others

Reformulate cross-entropy loss on top of soft-max intoa single parameter function that combines soft-maxoutput and labels Same as logistic

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call logistic lossfunction

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties

1 Ay(0) = ln k2 minus1 le Aprimey(o) le 13 Aprimeprimey(o) ge 0

The first one says A(0) is depend on some constantrelated with number of labels The second one says Ais Lipschitz function with L = 1 The third one saysA is a convex function wrt x

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on set

X and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

Theorem 1 ((Koltchinskii and Panchenko 2000))Fix δ isin (0 1) and let F be a class of functions mappingfrom M to [0 1] Let (Mi)

`iminus1 be drawn independently

according to a probability distribution D Then withprobability at least 1minusδ over random draws of samplesof size ` every f isin F satisfies

E [f(M)] le E [f(M)] +R`(F) +

radicln (2δ)

2`

le E [f(M)] + R`(F) + 3

radicln (2δ)

2`

2 Bound Derivation

Theorem 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions If A R rarr R is Lipschitzwith constant L and satisfies A(0) = 0 then R`(A F ) le 2LR(F)

Lemma 2 Let F be class of real functions and H =[Fj ]kj=1 be a k-dimensional function class If A Rk rarrR is a Lipschitz function with constant L and satisfiesA(0) = 0 then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1

`

sumi=1

Ayi(oi) + 2kR`(F) + 3

radicln (2δ)

2`

Proof From Lemma 1 Logistic loss function (A minusc)(x) isin A due to (A minus c)prime(x) le 1 and (A minus c)(0) = 0with some constant c By Lemma 2 R`((Aminusc)F) le2kR`(F)

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Regularization of Neural Networks using DropConnect Supplementary Material

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parameterized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof

R`(H G)

= Eσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σih g(xi)

∣∣∣∣∣]

= Eσ

[sup

gisinGWleB

∣∣∣∣∣langW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le BEσ

supfjisinF

∥∥∥∥∥∥[

2

`

sumi=1

σji fj(xi)

]dj=1

∥∥∥∥∥∥

= BradicdEσ

[supfisinF

∣∣∣∣∣2` sumi=1

σif(xi)

∣∣∣∣∣]

=radicdBR`(F)

Remark 2 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 Thislayer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on m then R`(EM [FM ]) le EM

[R`(FM )

]Proof

R`(EM [FM ]) = R`

(sumM

p (M)FM

)lesumM

R`(p(M)FM )

lesumM

|p(M)|R`(FM ) = EM

[R`(FM )

]because of common fact 1) R`(cF) = |c|R`(F) and 2)R`(sumi Fi) le

sumi R`(Fi)

Theorem 3 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ])

le EM[R`(f(x θM)

](2)

= EM[R`(s a hm g)

]le (

radicdkBs)

radicdEM

[R`(a hm g)

](3)

= 2radickdBsEM

[R`(hm g)

](4)

where hm = (M W )v Equation (2) is based onLemma 6 Equation (3) is based on Lemma 5 andEquation (4) follows from Lemma 4

EM[R`(hm g)

]= Emσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiwTDMg(xi)

∣∣∣∣∣]

(5)

= Emσ

[sup

hisinHgisinG

∣∣∣∣∣langDMw

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxwDMw

]Eσ

supgjisinG

∥∥∥∥∥∥[2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(6)

le Bhpradicnd(radic

nR`(G))= pn

radicdBhR`(G)

where DM in Equation (5) is an diagonal matrix withdiagonal elements equal to m and inner product prop-erties lead to Equation (6) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Remark 3 Theorem 3 implies that p is an additionalregularizer we have added to network when we converta normal neural network to a network with DropCon-nect layers Consider the following extreme cases

1 p = 0 the network generalization bound equals to0 which is true because classifier does not depends oninput any more

2 p = 1 reduce to normal network

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Regularization of Neural Networks using DropConnect Supplementary Material

Symbol Description Related Formulay Data Label can either be integer label for bit vec-

tor(depends on context)x Network input datag() Feature extractor function with parameter Wg

v Feature extractor network output v = g(xWg)M DropConnect connection information parameter

(weight mask)h() DropConnect transformation function with parame-

ter WMu DropConnect output u = h(vWM)a() DropConnect activation functionr DropConnect after activation r = a(u)s() Dimension reduction layer function with parameter

Ws

o Dimension reduction layer output (network output) o = s(rWs)θ All parameter of network expect weight mask θ = WsWWgf() Overall classifier(network) output o = f(x θM)λ Weight penaltyA() Data Loss Function A(ominus y)L() Over all objective function L(x y) =

sumiA(oi minus yi) + 12λW22

n Dimension of feature extractor outputd Dimension of DropConnect layer outputk number of class dim(y) = k

Table 1 Symbol Table

References

V Koltchinskii and D Panchenko Empirical margindistributions and bounding the generalization errorof combined classifiers Annals of Statistics 3020022000

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

Page 9: Regularization of Neural Networks using …yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdfRegularization of Neural Networks using DropConnect DropConnect weights W (d x n) b) DropConnect

Regularization of Neural Networks using DropConnect

References

D Ciresan U Meier and J Schmidhuber Multi-column deep neural networks for image classifica-tion In Proceedings of the 2012 IEEE Confer-ence on Computer Vision and Pattern Recognition(CVPR) CVPR rsquo12 pages 3642ndash3649 WashingtonDC USA 2012 IEEE Computer Society ISBN 978-1-4673-1226-4

G E Hinton N Srivastava A KrizhevskyI Sutskever and R Salakhutdinov Improving neu-ral networks by preventing co-adaptation of featuredetectors CoRR abs12070580 2012

A Krizhevsky Learning Multiple Layers of Featuresfrom Tiny Images Masterrsquos thesis University ofToront 2009

A Krizhevsky cuda-convnet httpcodegoogle

compcuda-convnet 2012

Y LeCun L Bottou Y Bengio and P HaffnerGradient-based learning applied to document recog-nition Proceedings of the IEEE 86(11)2278 ndash2324nov 1998 ISSN 0018-9219 doi 1011095726791

Y LeCun F J Huang and L Bottou Learning meth-ods for generic object recognition with invariance topose and lighting In Proceedings of the 2004 IEEEcomputer society conference on Computer vision andpattern recognition CVPRrsquo04 pages 97ndash104 Wash-ington DC USA 2004 IEEE Computer Society

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

D J C Mackay Probable networks and plausiblepredictions - a review of practical bayesian methodsfor supervised neural networks In Bayesian methodsfor backpropagation networks Springer 1995

V Nair and G E Hinton Rectified Linear Units Im-prove Restricted Boltzmann Machines In ICML2010

Y Netzer T Wang Coates A A Bissacco B Wuand A Y Ng Reading digits in natural images withunsupervised feature learning In NIPS Workshopon Deep Learning and Unsupervised Feature Learn-ing 2011 2011

J Snoek H Larochelle and R A Adams Practi-cal bayesian optimization of machine learning algo-rithms In Neural Information Processing Systems2012

A S Weigend D E Rumelhart and B A HubermanGeneralization by weight-elimination with applica-tion to forecasting In NIPS 1991

M D Zeiler and R Fergus Stochastic pooling forregualization of deep convolutional neural networksIn ICLR 2013

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

Regularization of Neural Networks using DropConnectSupplementary Material

1 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define DropConnect network as amixture model

o =summ

p(M)f(x θM) = Em [f(x θM)] (1)

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters w are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more m is the DropConnect layermask

Remark 1 when each element of Mi has equal proba-bility of being on and off (p = 05) the mixture modelhas equal weights for all sub-models f(x θM) oth-erwise the mixture model has larger weights in somesub-models than others

Reformulate cross-entropy loss on top of soft-max intoa single parameter function that combines soft-maxoutput and labels Same as logistic

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call logistic lossfunction

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties

1 Ay(0) = ln k2 minus1 le Aprimey(o) le 13 Aprimeprimey(o) ge 0

The first one says A(0) is depend on some constantrelated with number of labels The second one says Ais Lipschitz function with L = 1 The third one saysA is a convex function wrt x

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on set

X and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

Theorem 1 ((Koltchinskii and Panchenko 2000))Fix δ isin (0 1) and let F be a class of functions mappingfrom M to [0 1] Let (Mi)

`iminus1 be drawn independently

according to a probability distribution D Then withprobability at least 1minusδ over random draws of samplesof size ` every f isin F satisfies

E [f(M)] le E [f(M)] +R`(F) +

radicln (2δ)

2`

le E [f(M)] + R`(F) + 3

radicln (2δ)

2`

2 Bound Derivation

Theorem 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions If A R rarr R is Lipschitzwith constant L and satisfies A(0) = 0 then R`(A F ) le 2LR(F)

Lemma 2 Let F be class of real functions and H =[Fj ]kj=1 be a k-dimensional function class If A Rk rarrR is a Lipschitz function with constant L and satisfiesA(0) = 0 then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1

`

sumi=1

Ayi(oi) + 2kR`(F) + 3

radicln (2δ)

2`

Proof From Lemma 1 Logistic loss function (A minusc)(x) isin A due to (A minus c)prime(x) le 1 and (A minus c)(0) = 0with some constant c By Lemma 2 R`((Aminusc)F) le2kR`(F)

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Regularization of Neural Networks using DropConnect Supplementary Material

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parameterized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof

R`(H G)

= Eσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σih g(xi)

∣∣∣∣∣]

= Eσ

[sup

gisinGWleB

∣∣∣∣∣langW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le BEσ

supfjisinF

∥∥∥∥∥∥[

2

`

sumi=1

σji fj(xi)

]dj=1

∥∥∥∥∥∥

= BradicdEσ

[supfisinF

∣∣∣∣∣2` sumi=1

σif(xi)

∣∣∣∣∣]

=radicdBR`(F)

Remark 2 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 Thislayer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on m then R`(EM [FM ]) le EM

[R`(FM )

]Proof

R`(EM [FM ]) = R`

(sumM

p (M)FM

)lesumM

R`(p(M)FM )

lesumM

|p(M)|R`(FM ) = EM

[R`(FM )

]because of common fact 1) R`(cF) = |c|R`(F) and 2)R`(sumi Fi) le

sumi R`(Fi)

Theorem 3 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ])

le EM[R`(f(x θM)

](2)

= EM[R`(s a hm g)

]le (

radicdkBs)

radicdEM

[R`(a hm g)

](3)

= 2radickdBsEM

[R`(hm g)

](4)

where hm = (M W )v Equation (2) is based onLemma 6 Equation (3) is based on Lemma 5 andEquation (4) follows from Lemma 4

EM[R`(hm g)

]= Emσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiwTDMg(xi)

∣∣∣∣∣]

(5)

= Emσ

[sup

hisinHgisinG

∣∣∣∣∣langDMw

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxwDMw

]Eσ

supgjisinG

∥∥∥∥∥∥[2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(6)

le Bhpradicnd(radic

nR`(G))= pn

radicdBhR`(G)

where DM in Equation (5) is an diagonal matrix withdiagonal elements equal to m and inner product prop-erties lead to Equation (6) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Remark 3 Theorem 3 implies that p is an additionalregularizer we have added to network when we converta normal neural network to a network with DropCon-nect layers Consider the following extreme cases

1 p = 0 the network generalization bound equals to0 which is true because classifier does not depends oninput any more

2 p = 1 reduce to normal network

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Regularization of Neural Networks using DropConnect Supplementary Material

Symbol Description Related Formulay Data Label can either be integer label for bit vec-

tor(depends on context)x Network input datag() Feature extractor function with parameter Wg

v Feature extractor network output v = g(xWg)M DropConnect connection information parameter

(weight mask)h() DropConnect transformation function with parame-

ter WMu DropConnect output u = h(vWM)a() DropConnect activation functionr DropConnect after activation r = a(u)s() Dimension reduction layer function with parameter

Ws

o Dimension reduction layer output (network output) o = s(rWs)θ All parameter of network expect weight mask θ = WsWWgf() Overall classifier(network) output o = f(x θM)λ Weight penaltyA() Data Loss Function A(ominus y)L() Over all objective function L(x y) =

sumiA(oi minus yi) + 12λW22

n Dimension of feature extractor outputd Dimension of DropConnect layer outputk number of class dim(y) = k

Table 1 Symbol Table

References

V Koltchinskii and D Panchenko Empirical margindistributions and bounding the generalization errorof combined classifiers Annals of Statistics 3020022000

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

Page 10: Regularization of Neural Networks using …yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdfRegularization of Neural Networks using DropConnect DropConnect weights W (d x n) b) DropConnect

000001002003004005006007008009010011012013014015016017018019020021022023024025026027028029030031032033034035036037038039040041042043044045046047048049050051052053054

055056057058059060061062063064065066067068069070071072073074075076077078079080081082083084085086087088089090091092093094095096097098099100101102103104105106107108109

Regularization of Neural Networks using DropConnectSupplementary Material

1 Preliminaries

Definition 1 (DropConnect Network) Given dataset S with ` entries x1x2 x` with labelsy1 y2 y` we define DropConnect network as amixture model

o =summ

p(M)f(x θM) = Em [f(x θM)] (1)

Each network f(x θM) has weights p(M) and net-work parameters are θ = WsWWg Ws are thesoftmax layer parameters w are the DropConnectlayer parameters and Wg are the feature extractor pa-rameters Further more m is the DropConnect layermask

Remark 1 when each element of Mi has equal proba-bility of being on and off (p = 05) the mixture modelhas equal weights for all sub-models f(x θM) oth-erwise the mixture model has larger weights in somesub-models than others

Reformulate cross-entropy loss on top of soft-max intoa single parameter function that combines soft-maxoutput and labels Same as logistic

Definition 2 (Logistic Loss) The following loss func-tion defined on k-class classification is call logistic lossfunction

Ay(o) = minussumi

yi lnexp oisumj exp(oj)

= minusoi + lnsumj

exp(oj)

where y is binary vector with ith bit set on

Lemma 1 Logistic loss function A has the followingproperties

1 Ay(0) = ln k2 minus1 le Aprimey(o) le 13 Aprimeprimey(o) ge 0

The first one says A(0) is depend on some constantrelated with number of labels The second one says Ais Lipschitz function with L = 1 The third one saysA is a convex function wrt x

Definition 3 (Rademacher complexity) For a sampleS = x1 x` generated by a distribution D on set

X and a real-valued function class F in domain X theempirical Rademacher complexity of F is the randomvariable

R` (F) = Eσ

[supfisinF|2`

sumi=1

σif(xi)| | x1 x`

]

where sigma = σ1 σ` are independent uniformplusmn1-valued (Rademacher) random variables The

Rademacher complexity of F is R`(F) = ES

[R` (F)

]

Theorem 1 ((Koltchinskii and Panchenko 2000))Fix δ isin (0 1) and let F be a class of functions mappingfrom M to [0 1] Let (Mi)

`iminus1 be drawn independently

according to a probability distribution D Then withprobability at least 1minusδ over random draws of samplesof size ` every f isin F satisfies

E [f(M)] le E [f(M)] +R`(F) +

radicln (2δ)

2`

le E [f(M)] + R`(F) + 3

radicln (2δ)

2`

2 Bound Derivation

Theorem 2 ((Ledoux and Talagrand 1991)) Let Fbe class of real functions If A R rarr R is Lipschitzwith constant L and satisfies A(0) = 0 then R`(A F ) le 2LR(F)

Lemma 2 Let F be class of real functions and H =[Fj ]kj=1 be a k-dimensional function class If A Rk rarrR is a Lipschitz function with constant L and satisfiesA(0) = 0 then R`(A H) le 2kLR`(F)

Lemma 3 (Classifier Generalization Bound) Gener-alization bound of a k-class classifier with logistic lossfunction is directly related Rademacher complexity ofthat classifier

E[Ay(o)] le 1

`

sumi=1

Ayi(oi) + 2kR`(F) + 3

radicln (2δ)

2`

Proof From Lemma 1 Logistic loss function (A minusc)(x) isin A due to (A minus c)prime(x) le 1 and (A minus c)(0) = 0with some constant c By Lemma 2 R`((Aminusc)F) le2kR`(F)

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Regularization of Neural Networks using DropConnect Supplementary Material

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parameterized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof

R`(H G)

= Eσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σih g(xi)

∣∣∣∣∣]

= Eσ

[sup

gisinGWleB

∣∣∣∣∣langW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le BEσ

supfjisinF

∥∥∥∥∥∥[

2

`

sumi=1

σji fj(xi)

]dj=1

∥∥∥∥∥∥

= BradicdEσ

[supfisinF

∣∣∣∣∣2` sumi=1

σif(xi)

∣∣∣∣∣]

=radicdBR`(F)

Remark 2 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 Thislayer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on m then R`(EM [FM ]) le EM

[R`(FM )

]Proof

R`(EM [FM ]) = R`

(sumM

p (M)FM

)lesumM

R`(p(M)FM )

lesumM

|p(M)|R`(FM ) = EM

[R`(FM )

]because of common fact 1) R`(cF) = |c|R`(F) and 2)R`(sumi Fi) le

sumi R`(Fi)

Theorem 3 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ])

le EM[R`(f(x θM)

](2)

= EM[R`(s a hm g)

]le (

radicdkBs)

radicdEM

[R`(a hm g)

](3)

= 2radickdBsEM

[R`(hm g)

](4)

where hm = (M W )v Equation (2) is based onLemma 6 Equation (3) is based on Lemma 5 andEquation (4) follows from Lemma 4

EM[R`(hm g)

]= Emσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiwTDMg(xi)

∣∣∣∣∣]

(5)

= Emσ

[sup

hisinHgisinG

∣∣∣∣∣langDMw

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxwDMw

]Eσ

supgjisinG

∥∥∥∥∥∥[2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(6)

le Bhpradicnd(radic

nR`(G))= pn

radicdBhR`(G)

where DM in Equation (5) is an diagonal matrix withdiagonal elements equal to m and inner product prop-erties lead to Equation (6) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Remark 3 Theorem 3 implies that p is an additionalregularizer we have added to network when we converta normal neural network to a network with DropCon-nect layers Consider the following extreme cases

1 p = 0 the network generalization bound equals to0 which is true because classifier does not depends oninput any more

2 p = 1 reduce to normal network

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Regularization of Neural Networks using DropConnect Supplementary Material

Symbol Description Related Formulay Data Label can either be integer label for bit vec-

tor(depends on context)x Network input datag() Feature extractor function with parameter Wg

v Feature extractor network output v = g(xWg)M DropConnect connection information parameter

(weight mask)h() DropConnect transformation function with parame-

ter WMu DropConnect output u = h(vWM)a() DropConnect activation functionr DropConnect after activation r = a(u)s() Dimension reduction layer function with parameter

Ws

o Dimension reduction layer output (network output) o = s(rWs)θ All parameter of network expect weight mask θ = WsWWgf() Overall classifier(network) output o = f(x θM)λ Weight penaltyA() Data Loss Function A(ominus y)L() Over all objective function L(x y) =

sumiA(oi minus yi) + 12λW22

n Dimension of feature extractor outputd Dimension of DropConnect layer outputk number of class dim(y) = k

Table 1 Symbol Table

References

V Koltchinskii and D Panchenko Empirical margindistributions and bounding the generalization errorof combined classifiers Annals of Statistics 3020022000

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

Page 11: Regularization of Neural Networks using …yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdfRegularization of Neural Networks using DropConnect DropConnect weights W (d x n) b) DropConnect

110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164

165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219

Regularization of Neural Networks using DropConnect Supplementary Material

Lemma 4 For all neuron activations sigmoid tanhand relu we have R`(a F) le 2R`(F)

Lemma 5 (Network Layer Bound) Let G be the classof real functions Rd rarr R with input dimension F ieG = [Fj ]dj=1 and HB is a linear transform function

parameterized by W with W2 le B then R`(HG) leradicdBR`(F)

Proof

R`(H G)

= Eσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σih g(xi)

∣∣∣∣∣]

= Eσ

[sup

gisinGWleB

∣∣∣∣∣langW

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le BEσ

supfjisinF

∥∥∥∥∥∥[

2

`

sumi=1

σji fj(xi)

]dj=1

∥∥∥∥∥∥

= BradicdEσ

[supfisinF

∣∣∣∣∣2` sumi=1

σif(xi)

∣∣∣∣∣]

=radicdBR`(F)

Remark 2 Given a layer in our network we denotethe function of all layers before as G = [Fj ]dj=1 Thislayer has the linear transformation function H and ac-tivation function a By Lemma 4 and Lemma 5 weknow the network complexity is bounded by

R`(H G) le cradicdBR`(F)

where c = 1 for identity neuron and c = 2 for others

Lemma 6 Let FM be the class of real functions that

depend on m then R`(EM [FM ]) le EM

[R`(FM )

]Proof

R`(EM [FM ]) = R`

(sumM

p (M)FM

)lesumM

R`(p(M)FM )

lesumM

|p(M)|R`(FM ) = EM

[R`(FM )

]because of common fact 1) R`(cF) = |c|R`(F) and 2)R`(sumi Fi) le

sumi R`(Fi)

Theorem 3 (DropConnect Network Complexity)Consider the DropConnect neural network defined inDefinition 1 Let R`(G) be the empirical Rademachercomplexity of the feature extractor and R`(F) be theempirical Rademacher complexity of the whole net-work In addition we assume

1 weight parameter of DropConnect layer |W | le Bh2 weight parameter of s ie |Ws| le Bs (L2-norm ofit is bounded by

radicdkBs)

Then we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Proof

R`(F) = R`(EM [f(x θM ])

le EM[R`(f(x θM)

](2)

= EM[R`(s a hm g)

]le (

radicdkBs)

radicdEM

[R`(a hm g)

](3)

= 2radickdBsEM

[R`(hm g)

](4)

where hm = (M W )v Equation (2) is based onLemma 6 Equation (3) is based on Lemma 5 andEquation (4) follows from Lemma 4

EM[R`(hm g)

]= Emσ

[sup

hisinHgisinG

∣∣∣∣∣2` sumi=1

σiwTDMg(xi)

∣∣∣∣∣]

(5)

= Emσ

[sup

hisinHgisinG

∣∣∣∣∣langDMw

2

`

sumi=1

σig(xi)

rang∣∣∣∣∣]

le EM[maxwDMw

]Eσ

supgjisinG

∥∥∥∥∥∥[2

`

sumi=1

σigj(xi)

]nj=1

∥∥∥∥∥∥(6)

le Bhpradicnd(radic

nR`(G))= pn

radicdBhR`(G)

where DM in Equation (5) is an diagonal matrix withdiagonal elements equal to m and inner product prop-erties lead to Equation (6) Thus we have

R`(F) le p(

2radickdBsn

radicdBh

)R`(G)

Remark 3 Theorem 3 implies that p is an additionalregularizer we have added to network when we converta normal neural network to a network with DropCon-nect layers Consider the following extreme cases

1 p = 0 the network generalization bound equals to0 which is true because classifier does not depends oninput any more

2 p = 1 reduce to normal network

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Regularization of Neural Networks using DropConnect Supplementary Material

Symbol Description Related Formulay Data Label can either be integer label for bit vec-

tor(depends on context)x Network input datag() Feature extractor function with parameter Wg

v Feature extractor network output v = g(xWg)M DropConnect connection information parameter

(weight mask)h() DropConnect transformation function with parame-

ter WMu DropConnect output u = h(vWM)a() DropConnect activation functionr DropConnect after activation r = a(u)s() Dimension reduction layer function with parameter

Ws

o Dimension reduction layer output (network output) o = s(rWs)θ All parameter of network expect weight mask θ = WsWWgf() Overall classifier(network) output o = f(x θM)λ Weight penaltyA() Data Loss Function A(ominus y)L() Over all objective function L(x y) =

sumiA(oi minus yi) + 12λW22

n Dimension of feature extractor outputd Dimension of DropConnect layer outputk number of class dim(y) = k

Table 1 Symbol Table

References

V Koltchinskii and D Panchenko Empirical margindistributions and bounding the generalization errorof combined classifiers Annals of Statistics 3020022000

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991

Page 12: Regularization of Neural Networks using …yann.lecun.com/exdb/publis/pdf/wan-icml-13.pdfRegularization of Neural Networks using DropConnect DropConnect weights W (d x n) b) DropConnect

220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274

275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329

Regularization of Neural Networks using DropConnect Supplementary Material

Symbol Description Related Formulay Data Label can either be integer label for bit vec-

tor(depends on context)x Network input datag() Feature extractor function with parameter Wg

v Feature extractor network output v = g(xWg)M DropConnect connection information parameter

(weight mask)h() DropConnect transformation function with parame-

ter WMu DropConnect output u = h(vWM)a() DropConnect activation functionr DropConnect after activation r = a(u)s() Dimension reduction layer function with parameter

Ws

o Dimension reduction layer output (network output) o = s(rWs)θ All parameter of network expect weight mask θ = WsWWgf() Overall classifier(network) output o = f(x θM)λ Weight penaltyA() Data Loss Function A(ominus y)L() Over all objective function L(x y) =

sumiA(oi minus yi) + 12λW22

n Dimension of feature extractor outputd Dimension of DropConnect layer outputk number of class dim(y) = k

Table 1 Symbol Table

References

V Koltchinskii and D Panchenko Empirical margindistributions and bounding the generalization errorof combined classifiers Annals of Statistics 3020022000

M Ledoux and M Talagrand Probability in BanachSpaces Springer New York 1991


Recommended