Enhanced Neural Network Training Using Selective ... · Backpropagation and Forward Propagation...

transcript

Enhanced Neural Network Training Using Selective

Backpropagation and Forward Propagation

Shiri Bendelac

Thesis submitted to the Faculty of the

Virginia Polytechnic Institute and State University

in partial fulfillment of the requirements for the degree of

Masters of Science

Computer Engineering

Joseph M. Ernst, Co-chair

Jia-Bin Huang, Co-chair

Christopher L. Wyatt

William C. Headley

May 7, 2018

Blacksburg, Virginia

Keywords: Machine Learning, Neural Networks, Convolutional Neural Networks,

Backpropagation, Selective Training

Shiri Bendelac

(ABSTRACT)

Neural networks are making headlines every day as the tool of the future, powering artificial

intelligence programs and supporting technologies never seen before. However, the training

of neural networks can take days or even weeks for bigger networks, and requires the use of

super computers and GPUs in academia and industry in order to achieve state of the art

results. This thesis discusses employing selective measures to determine when to backpropa-

gate and forward propagate in order to reduce training time while maintaining classification

performance. This thesis tests these new algorithms on the MNIST and CASIA datasets,

and achieves successful results with both algorithms on the two datasets. The selective back-

propagation algorithm shows a reduction of up to 93.3% of backpropagations completed, and

the selective forward propagation algorithm shows a reduction of up to 72.90% in forward

propagations and backpropagations completed compared to baseline runs of always forward

propagating and backpropagating. This work also discusses employing the selective back-

propagation algorithm on a modified dataset with disproportional under-representation of

some classes compared to others.

Shiri Bendelac

(GENERAL AUDIENCE ABSTRACT)

Neural Networks are some of the most commonly used and best performing tools in machine

learning. However, training them to perform well is a tedious task that can take days or even

weeks, since bigger networks perform better but take exponentially longer to train. What

can be done to reduce training time? Imagine a student studying for a test. The student

likely solves practice problems that cover the different topics that may be covered on the test.

The student then evaluates which topics he/she knew well, and forgoes extensive practice

and review on those in favor of focusing on topics he/she missed or was not as confident

on. This thesis discusses following a similar approach in training neural networks in order

to reduce their training time needed to achieve desired performance levels.

Dedication

To my parents and brothers, with whom everything is an adventure.

Acknowledgments

I would like to thank my committee, Dr. Ernst, Dr. Huang, Dr. Wyatt, and Dr. Headley,

for supporting me in my work and guiding me throughout my undergraduate and graduate

career. Thank you to everyone at the Hume Center for being a valuable resource and

encouraging me to take on new challenges. Thank you to the OPM for supporting my

academic endeavors through the SFS program. Thank you to my friends for the great

memories from the past five years at VT. Thank you to everyone I’ve worked and interacted

with in the past few years who helped me get to this point. And finally, a big thank you to

my family for their endless love and support.

Contents

List of Figures ix

List of Tables xiii

1 Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.4 Organization of Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2 Datasets 7

2.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2 CASIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 NN Library 11

3.1 Motivation for Developing a New Library . . . . . . . . . . . . . . . . . . . . 11

3.1.1 Understanding GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.2 Library Qualification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3 Library API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.1 NeuralNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.3.2 ImageIn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.4 MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

3.5 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.6 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.7 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.8 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.9 dE/dw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.10 Validation, Testing, and Weights Logging . . . . . . . . . . . . . . . . . . . . 25

3.11 Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.12 Input File Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

4 Selective Backpropagation 32

4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

4.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

4.3.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3.2 CASIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5 Selective Forward Propagation 53

5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

5.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3.2 CASIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Future Work 63

6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

7 Conclusions 66

7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

Bibliography 68

Appendices 78

A Mathematical Derivations 78

A.1 Forward and Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . 78

A.2 Softmax and Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

List of Figures

2.1 Sample digits from the MNIST dataset. The training set includes 6,000 pic-

tures from each category, that are each 28x28 grayscale pixels. . . . . . . . 8

2.2 Two images of the same Chinese character, before and after preprocessing to

resize, center, and increase contrast. . . . . . . . . . . . . . . . . . . . . . . 9

3.1 Sample multilayer perceptron neural network with four inputs, one hidden

layer with six neurons, and three outputs. . . . . . . . . . . . . . . . . . . . 19

3.2 A neural network neuron j calculates its output based on the inputs, its weights

and bias, and activation function. . . . . . . . . . . . . . . . . . . . . . . . 19

3.3 Two types of pooling: max pooling and average pooling, which may be added

after a convolutional layer in order to downsample its input. . . . . . . . . . 21

3.4 Commonly used activation functions. In recent years, ReLU has become the

most popular. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.5 Augmentation visualization tool, showing the original image on the left and

the augmented version on the right. . . . . . . . . . . . . . . . . . . . . . . . 26

4.1 Number of Backpropagations VS Epoch Duration (s) at 0.9 BP filter. . . . 34

4.2 Epoch duration during training with BP 1.0 under different environment con-

ditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

4.3 Architecture of the CNN used to classify MNIST. . . . . . . . . . . . . . . . 40

4.4 MNIST testing accuracy with different BP thresholds. When plotting accu-

racy as function of epochs passed, the curves have no significant variation

between them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.5 MNIST Testing accuracy with different BP thresholds. BP 1.0, the baseline

of always backpropagating, takes longer (more BPs) to achieve the same per-

formance as other curves on the graph. It catches up but does not show better

performance than runs with a lower BP threshold in the long run. . . . . . 42

4.6 Zooming in on the initial relevant section of MNIST Testing accuracy with

different BP thresholds. BP 1.0, the baseline of always backpropagating,

initially significantly under performs runs that selectively backpropagate, and

takes time to catch up to them after they plateau. . . . . . . . . . . . . . . 42

4.7 Plotting both the performance on the testing set as well as the numbers of

BPs performed in each epoch shows the rapid decrease in BPs performed,

dropping to below 10,000 after only 7 epochs from the initial 54,000, which

would remain constant for a baseline test. By that point, performance reaches

96.04% on the testing set. It goes on to pass 98.7%, the majority of that time

spent completing less than 5,000 BPs per epoch. . . . . . . . . . . . . . . . 43

4.8 Testing accuracy over BPs on disproportional MNIST dataset. . . . . . . . 44

4.9 Error matrices early in the training process, training a neural network on a

disproportional MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . 46

4.10 Histogram of total misses per class early in the training process. . . . . . . 47

4.11 Error matrices at the end of the training process, training a neural network

on a disproportional MNIST dataset. . . . . . . . . . . . . . . . . . . . . . 48

4.12 Histogram of total misses per class early in the training process. . . . . . . 49

4.13 Architecture of the CNN used to classify CASIA. . . . . . . . . . . . . . . . 50

4.14 CASIA testing accuracy with different BP thresholds. When plotting accuracy

as function of epochs passed, the curves show no significant variation between

them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.15 CASIA Testing accuracy with different BP thresholds. BP 1.0, the baseline

of always backpropagating, stagnates behind all other curves, taking at least

2.2 times as long to reach 92% testing accuracy. . . . . . . . . . . . . . . . 52

4.16 Zooming in on the relevant section of CASIA Testing accuracy with different

BP thresholds. BP 1.0, the baseline of always backpropagating, significantly

under performs runs that selectively backpropagate, and shows no sign of

catching up to the rate at which they improve. . . . . . . . . . . . . . . . . 52

5.1 Number of Forward Propagations VS Epoch Duration (s) at FP Thresh-

old=0.5 and Max Delay=15, showing a positive linear relationship between

FPs and epoch duration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.2 Testing accuracy over FPs on the standard MNIST dataset with different

FP max delays. In all runs, FP threshold is set to 0.5. Over time, all runs

converge to approximately the same level of performance. . . . . . . . . . . 59

5.3 Zooming in on the initial relevant section of MNIST testing accuracy with

different FP max delays, with FP threshold set to 0.5. FP 1, the baseline of

always forward propagating and always backpropagating, lags behind other

curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.4 Training on different subsets of MNIST shows using selective BP and FP

algorithms result in increased generalization variability over the baseline. . . 60

5.5 Testing accuracy over FPs on the CASIA subset dataset with various FP max

delays with FP threshold set to 0.5. Over time, runs converge to approxi-

mately the same performance on the testing set. . . . . . . . . . . . . . . . 62

5.6 Zooming in on the initial relevant section of Fig. 5.5. FP max delay of 1, the

threshold, is seen lagging behind the other curves before they all plateau. . 62

List of Tables

3.1 Library verification results showing integration process for features and vali-

dation against other framework. . . . . . . . . . . . . . . . . . . . . . . . . . 15

5.1 Analysis of BP and FP combinations, showing potential benefits in perfor-

mance and time reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

7.1 Summary of time improvements achieved with selective BP and selective FP,

including on modified imbalanced MNIST. . . . . . . . . . . . . . . . . . . . 66

Chapter 1

Introduction

1.1 Motivation

Artificial Intelligence has become one of the biggest fields in industry and academia, with

neural networks becoming a tool widely deployed to solve all sorts of different problems,

from medical applications such as detecting cancer [1, 2], to self driving cars [3, 4, 5], Optical

Character Recognition (OCR) [6, 7], cyber security [8, 9], face recognition [10], and cognitive

radios and spectrum sensing [11, 12, 13], to list a few.

Training a neural network (NN), however, takes time. The more data available to train the

network, and the bigger the network, the longer the training takes. This can result in training

routines that take weeks [14, 15, 16]. While letting the training phase take that long may

lead to record breaking performance of networks, it may be an impractical constraint that

takes a toll on the ability to train networks quickly. Further more, it is the reason that tech

giants invest in farms of Graphics Processing Units (GPUs), an expensive technology, to try

and reduce the training time, while others are testing other specialized hardware designed

for these computations, as well as for faster inference [17, 18, 19, 20].

This paper proposes changes to the training routine by taking a closer look at the forward

and backpropagation, and introducing a new decision process for whether or not to train

on certain input vectors. The results discussed show that training time when training with

2 Chapter 1. Introduction

serial routines takes significantly less time. Runs that employ the selective BP algorithm

show a reduction of up to 93.3% of backpropagations completed, and runs with the selective

forward propagation algorithm show a reduction of up to 72.90% in forward propagations

and backpropagations completed compared to baseline runs of always forward propagating

and backpropagating. Employing selective algorithms to convergence shows they plateau

at approximately the same level of performance as baseline, although they do show more

variability in generalization. The selective BP algorithm is also examined on imbalanced

data and shows equal promise in reducing training time.

1.2 Contributions

This research seeks to discuss an opportunity to reduce training time without hindering per-

formance by introducing conditional forward propagation and backpropagation, as opposed

to current methods of forward propagating and backpropagating every single input vector

iteratively until performance ceases to improve. In doing so, a faster training algorithm is de-

veloped that can be used on lower cost and lower resource platforms, such as low-end CPUs,

embedded devices, on-the-field devices, and time sensitive applications where the speed of

the learning curve is critical. The work also shows potential improvement in training time

on imbalanced datasets.

Additionally, a big effort in this work went into the development of a framework that supports

these algorithms. This source code is expected to be released for public use some time in

the near future.

The work discussed in Chapter 4 of this thesis will also be submitted as a conference paper.

1.3. Related Work 3

1.3 Related Work

Different algorithms have been proposed under the name of selective training or selective

backpropagation, or otherwise resembling some of the work in this thesis. Engelbrecht pro-

posed two modifications to neural network training routines: incremental learning and selec-

tive learning [21]. Both are centered around the idea of training the network to convergence

on small subsets of the training set before exposing the network to other data, either by

adding the new subset to the subsets already used in training (incremental learning), or by

switching to the new subset and keeping the training subset used for any given epoch small.

These algorithms are both different from the work discussed here, which uses the network’s

performance as a guide in selecting which vectors to use for training.

Some work has been done to introduce another form of selective backpropagation on RBF

networks [22]. In this work, regular backpropagation is used for the majority of the training,

then when the loss function is minimal, binary backpropagation is used based on correct/in-

correct classification of vectors. This was done primarily to prevent overtraining of RBF

networks, and learn the last few unlearned vectors in the training set to reach better perfor-

mance.

Listprop [23] is another algorithm that shares resemblance with the selective BP algorithm

described here. This algorithm builds a list of which output neurons should be included in

the backpropagation, hence its name. Little work exists that fully tested Listprop’s potential,

though it largely targeted weight saturation. Listprop showed some reduction of training

time and increased peak performance, though this work was done with a 3-layered MLP and

may decrease with deeper networks.

The work discussed in this thesis is performed on classification problems using CNNs. How-

ever, in the object detection field specifically, focal loss [24] and hard example mining [25]

follow similar ideas. Focal loss modifies the equation for cross entropy loss to better overcome

class imbalance in a one-stage object detector. Hard example mining is an example boot-

strapping neural networks for detection problems with some resemblance to the incremental

learning algorithm discussed earlier. This also helps overcome class imbalance, which is

common in object detection problems with regards to balancing foreground and background

regions of interest (RoIs). Shrivastava’s work, similarly to the work discussed here, only

backpropagates difficult vectors, which it picks out as training takes place. The hard exam-

ple mining happens in two stages: first, the dataset is forward passed through the network,

and the difficult vectors are picked out. Then, once enough vectors have been marked as

problematic, the training occurs, which involves repeatedly backpropagating all those vec-

tors. Once this is done, the process reverts to continue iterating through the training set and

collect another subset of problematic vectors. This differs from the work in this thesis, where

each batch is only backpropagated once before the network continues to iterate through the

training set. Repeatedly using the vectors that have already been selected as hard allows the

network to spend more time training on difficult examples, rather than forward propagating

the majority of the training set that it already performs well on. However, the work in this

thesis achieves similar theoretical performance when combined with the selective forward

propagation, which after one epoch of examining all vectors would know to skip the easy

examples, essentially meaning that the next few epochs would only be composed of forward

and backpropagating the challenging vectors that the hard example mining algorithm would

iterate through.

1.4. Organization of Paper 5

1.4 Organization of Paper

Chapter 2 provides an introduction to the datasets used in this work, MNIST and CASIA.

The chapter explains why these two datasets were chosen, and what pre-processing was

completed prior to training the network on this data, as well as the motivations for these

choices.

Chapter 3 discusses the NN library developed for this research. The chapter opens with a

survey of existing frameworks, as well as factors that were considered for this application,

explaining why the ultimate decision was to develop a new framework. This is followed by a

deeper examination of the library, including its structure and public API, as well as a discus-

sion of different features implemented in the library, including MLP and convolution, types

of pooling, activation functions, training validation and testing procedures, dE/dw normal-

ization, weights logging, data augmentation, and explanation of the input configuration file

format.

Chapter 4 discusses the selective backpropagation portion of this work, including an in-depth

discussion of the motivation for this method of training, the approach used, and analysis

of the results achieved on both the MNIST and CASIA datasets, as well as a modified

imbalanced MNIST subset.

Chapter 5 delves into the other portion of this work, that is the selective forward propagation.

Following a similar structure, it discusses the motivation and rational for applying a selective

algorithm on the forward propagation element of training neural networks, the approach

used, and discusses results achieved on MNIST and CASIA. This chapter also includes an

analysis of the algorithms’ generalization.

Chapter 6 suggests future work that may be completed in order to further develop ideas and

results displayed in this work.

Finally, Chapter 7 concludes this paper with a summary of the work completed, its motiva-

tion, and the results achieved.

Chapter 2

Datasets

For deep neural networks to work well, large amounts of data are needed for training. This

chapter discusses the two datasets used for this thesis: MNIST and CASIA, including infor-

mation about them, as well as why they were chosen.

2.1 MNIST

First published in [26], MNIST [27] has become one of the most popular databases to test

neural networks [28]. MNIST consists of images of handwritten digits, as seen in Fig. 2.1. The

images are in grayscale, so each pixel is represented as an unsigned byte with values ranging

from 0-255, and each image is composed of 784 pixels (28x28). Images were preprocessed to

normalize around their center of mass. MNIST provides a training set of 60,000 pictures,

and a testing set of an additional 10,000 pictures. There is no overlap of authors between

the two sets, to ensure the handwritings in each are unique. MNIST is frequently referred

to as a toy problem [29, 30] due to the ease of solving it compared to other datasets with

more classes, more samples, and larger inputs. Because of this, performance on MNIST is

well documented and it is often used as a ‘Hello World’ for ML and a test case for new

algorithms.

8 Chapter 2. Datasets

Figure 2.1: Sample digits from the MNIST dataset. The training set includes 6,000 picturesfrom each category, that are each 28x28 grayscale pixels.

2.2 CASIA

The Institute of Automation of Chinese Academy of Sciences (CASIA) offers several datasets

of handwritten Chinese characters, which were used for the ICDAR 2011 and 2013 competi-

tions [31]. The data used for this work was of offline handwriting, where inputs are pictures

of the handwritten characters, as opposed to online OCR which includes chronological in-

formation of how the character was drawn, offering information on the direction, speed, and

order of strokes the author used [32]. This set consists of 3755 classes (or characters) with

200 samples per class [33, 34]. Images are grayscale, and initially each image is a different

size. Similarly to the work in [14], images were preprocessed to be 48x48 pixels. Data is

encoded in one file, one image at a time, each prefixed by a 10 byte header, summing to a

constant 2314 bytes after the preprocessing takes place. The header contains the image’s

total size, its tag, and its width and height. This is followed by the image, each pixel taking

one byte. Training on the dataset, as is, would have taken significant amounts of time. In-

stead, a subset was created of the first 100 labels. The training set used included all samples

from the original training set of those 100 labels. Likewise, the new testing set used included

2.2. CASIA 9

(a) Chinese character 61111 sample 88 (b) Chinese character 61111 sample 8197

Figure 2.2: Two images of the same Chinese character, before and after preprocessing toresize, center, and increase contrast.

all samples of those 100 labels from the original testing set.

CASIA is a much more challenging dataset for a NN than MNIST. The images are larger, and

there are more classes, even with a subset of only 100 labels. Additionally, Chinese characters

have more strokes and features, and handwriting styles offer greater variability [14], including

due to changes in handwriting over time and frequency of characters [31]. CASIA also has

only a few hundred samples per class, compared to thousands in MNIST. Due to all of these

reasons, CASIA is a far more challenging dataset to train on compared to MNIST, and is

used in this thesis to test the algorithms discussed on more difficult problems. On the other

hand, CASIA was chosen over other popular difficult problems to ensure that even with a

minimized dataset, debugging of the framework and training could be completed under the

set time constraints.

2.2.1 Preprocessing

The CASIA dataset was preprocessed once to create a modified dataset. This processing was

primarily performed in order to resize the pictures to a constant size of 48x48 pixels from

10 Chapter 2. Datasets

their original varying sizes. An additional motive, however, was to perform some additional

processing, altering the image contrasts and centering the image around its center of gravity,

to facilitate the neural network extract features later on. In order to resize images, their

center of gravity (COG) was computed, as well as their variance in the X and Y directions.

Images were then centered around their COG and a scaled bitmap was computed to stretch

the image to 48x48 pixels while maintaining previous proportions. New pixels (mostly caused

from shifting COG) were filled with the detected background color. Finally, the picture’s

contrast was increased by finding the darkest pixel in the image and use it as a reference

point in scaling all the pixels’ darkness between 0 and 255. This preprocessing can be seen in

Fig. 2.2, which shows two different samples of the same classification. In each subfigure, the

image on the left is the original from the database, with a black rectangle around its margins

and is drawn centered at the calculated COG, as marked by a red dot. The images on the

right are the output post processing, which includes centering around the COG, stretching

proportionally according to the standard deviation to fit 4.5 sigmas into the frame, and

increasing the contrast of the image.

Chapter 3

NN Library

3.1 Motivation for Developing a New Library

One of the first steps in this research was to identify a machine learning framework that could

be modified to test the effectiveness of the algorithmic modifications developed. While several

frameworks were investigated, ultimately the decision was made to create a new framework

to facilitate this investigation. This section details the motivation for the creation of this

framework.

Early on, the experiment statement was established: trying to expedite the training routine

of neural networks. The common approach to training of neural networks involves iterating

through the training set, passing them cyclically through the network and backpropagating

to adjust the network until validation stops improving. This thesis experiment was to try and

accelerate this process with two methods: first, only backpropagate, or update the network,

based on training inputs that are misclassified, or classified with a low confidence; second,

try to detect which vectors the network is performing well enough on that they may be

skipped for a few iterations, and predict how long these vectors can be skipped. These ideas

are explained much more in detail as well as their achieved results in Chapters 4-5. The

intention was to be able to train a neural network with these changes in place, and see their

affect on the training speed as well as peak performance achieved.

12 Chapter 3. NN Library

From prior experience with neural networks, as well as from literature, it was clear early on

that the computation capabilities for training neural networks, especially with large datasets

and large networks, is significant. Additionally, this work had a time restriction of one

academic year. With this in mind, a state of the art computer within budget was purchased

for this project, which included the most powerful ‘desktop’ GPU on the market at the time

of purchase. This computer is equipped with a 4.2GHz Intel Core i7-7700K processor, and

an NVIDIA GeForce GTX 1080 Ti GPU with 11GB GDDR5X SDRAM, 3584 CUDA cores,

and a memory bandwidth of 484GBps, with the intention to use an existing framework that

is optimized to take advantage of this GPU.

There are many existing deep learning open-source frameworks, with TensorFlow [35], Keras [36],

Caffe [37], Theano [38], and PyTorch [39] being among the most popular, though there are

other additional lesser-known platforms as well [40]. As these frameworks were being studied

as potentials for this work, the algorithms for selective BP and FP were being refined, which

clarified the type of flexibility required of any framework that may be used. Specifically,

forward propagating and then sometimes deciding not to backpropagate, and a system in

place to decide whether or not to forward propagate. None of the frameworks that were

evaluated appeared to offer the option to make these decisions while using the network as a

black-box, which meant that the chosen framework’s source code would need to be modified

in a hopefully controlled and minimal fashion. Two libraries were studied more in depth

than others as this refining process was taking place: Caffe [37], and TinyDNN [41], though

others were also considered. The majority of the open source libraries are written in C++

at their core, though many also offer a Python API. When using an open-source framework,

GPU support was essential, since it can offer a theoretical speed increase of up to a 10 times

over a pure CPU implementation [42, 43]. While there are two main manufacturers, NVIDIA

and AMD [19], NVIDIA has a wide lead in the market [44] and its CUDA architecture is

3.1. Motivation for Developing a New Library 13

widely supported by many open-source frameworks, including Caffe. However, when ex-

amining GPU characteristics more closely, a major issue was revealed, which requires some

understanding of the characteristics of GPUs.

3.1.1 Understanding GPUs

GPUs are able to offer such impressive results compared to CPUs thanks to offering a colossal

parallelization ability [42]. GPUs have up to over 5,000 independent cores, the GTX 1080

TI has 3,584. Each core is a RISC processor capable of running generic C/C++ code. The

NVIDIA GTX 1080 Ti GPU has a GDDR5 SDRAM with a memory speed of 11 Gbps and

interface width of 352 bits, resulting in a 484 GBps memory bandwidth [45]. The CPU, for

comparison, has a memory speed of DDR4-3000 at 64 bits, resulting in 24 GBps, meaning

the GPU’s memory speed is 20 times faster. The problem is that this is shared across the

3584 cores, as opposed to 8 cores for the CPU (4 physical cores with hyperthreading). The

bottom line is that the GPU’s limited bandwidth serves as its main bottleneck, both to

external memory and to the host. The GPU grid is made up of an array of blocks, each is

able to run up to 1024 threads, with one Shared Memory (SM) for each block. The block

runs on a physical warp, which is composed of 32 cores. [46, 47]

It has been shown that convolutional layers take the most time during training of neural

networks, both on CPU and GPU [48, 49], taking as much as 90% of the time spent on a

forward pass. While GPUs are capable of running generic C++ code on independent cores,

the way NN libraries have overcome the memory bandwidth bottleneck to achieve accelerated

learning on a GPU is by parallelizing the batch computations in a way that takes advantage of

the functionality most optimized for GPUs, which is matrix multiplication. Generic Matrix

Multiplication (GEMM) is part of Basic Linear Algebra Subprograms (BLAS), which has

been optimized to run on NVIDIA’s GPUS using the CuBLAS library.

A fully connected layer can be represented as a vector by matrix multiplication, where the

input to the layer is a 1× k vector, and the layer of n neurons is a k × n vector of weights,

where each column is a repeat of the neuron’s weights. The layer’s output is then a single

1×n matrix [48]. A convolutional layer can be represented in the same manner, where in the

case of images there may be a 3D matrix as the input, and the weights of each kernel also

form a 3D matrix [48, 50]. In order for the forward propagation in this case to be represented

as matrix multiplication, both are turned into 2D matrices. In the input matrix, each row

represents the inputs of a single kernel, and in the weights matrix each kernel’s weights are

a single column. Since the stride in convolutional layers is typically less than the kernel size,

there is an overlap between the inputs, meaning that during this process there is a large

redundancy of memory. For example, in a 5x5 kernel with stride of 1, a single input point

would be represented in 25 different rows of the input matrix. However, the architecture

of GPUs requires some level of such a redundancy at any rate, due to shared memories

being unique to each block, and the redundancy in memory is outweighed by the speed

reduction offered by the optimization of matrix multiplications. These unrolling conversions

from image to matrix format are completed using CUDA’s im2col function. Once in matrix

format, the forward passes can be completed in a much faster fashion using the optimized

CuBLAS library [43, 49]. To get peak performance from unrolling data to matrix and using

matrix multiplication, libraries combine a mini-batch of images into one matrix, allowing the

whole batch to be computer in parallel one layer at a time. Using larger minibatches shows

increased speeds [51, 52, 53]. However, this method of completing the batch in parallel one

layer at a time does not align with the platform architecture that was wanted for testing

the algorithms discussed in this thesis, so the decision was made to test the algorithms on

a CPU platform, which allows for a simpler integration of these ideas. A study of such

3.2. Library Qualification 15

Table 3.1: Library verification results showing integration process for features and validationagainst other framework.

Train (%) Validation (%) Testing (%) Description98.45 98.45 98.24 Ciresan’s original code (double, 29x29, scaled

hyptertan)98.52 98.52 98.43 Ciresan’s modified code, double, 29x29,

scaled hyptertan98.51 98.517 98.45 Ciresan’s modified code, float, 29x29, scaled

hypertan98.51 98.503 98.5 Ciresan’s modified code, float, 28x28, scaled

hyptertan99.7 98.43 98.43 ShiriNet, float, 28, relu, BP threshold 0.5,

LR 0.01, momentum 0, adaptive LR 1.099.89 99.55 98.41 ShiriNet, float, 28, relu, BP threshold 0.5,

LR 0.005, momentum 0, adaptive LR 1.099.78 98.57 98.55 ShiriNet, float, 28, relu, BP threshold 0.5,

LR 0.005, momentum 0.9, adaptive LR 1.099.89 98.33 98.39 ShiriNet, float, 28, relu, BP threshold 0.5,

LR 0.005, momentum 0, adaptive LR 0.99999.89 98.02 97.99 ShiriNet, float, 28, htan, BP threshold 0.5,

LR 0.01, momentum 0.9, adaptive LR 1.099.73 98.45 98.39 ShiriNet, float, 28, relu, BP threshold 1, LR

0.01, momentum 0.9, adaptive LR 1.0

existing frameworks found most were no longer maintained and had some documented and

undocumented bugs, or their developers switched to contributing to the popular platforms

instead of maintaining their light frameworks. Ultimately the decision was made to develop

a new CNN library to run on a CPU without parallelization to assess the effect of these two

algorithms on training time and performance. This library is written in C++ and designed

to allow future integration on GPU platforms. This library is discussed in more detail in

this chapter.

3.2 Library Qualification

In efforts to verify the library works as expected, its performance on MNIST was compared

to documented performance. Additionally, each implemented feature was tested separately,

and tests with multiple features enabled were approached incrementally to verify stability.

Features that are binary were tested for both modes, and variable parameters (e.g., learning

rate, momentum, etc.) were tested via scan tests with ranging values. MLP performance was

tested in comparison to performance documented in [26]. Behavior on CNNs was compared

to performance of open source code published by Ciresan [54], which was gradually modified

to match the library in using floats instead of doubles and use an input of 28x28 instead of

29x29. These tests are documented in Table 3.1.

3.3 Library API

The code is designed to only have two public elements: an ImageIn object (for CPU or

GPU in the future) and a NeuralNet object. All other classes are private and need not be

interfaced with by the end user.

3.3.1 NeuralNet

The NeuralNet class offers the following public interface:

1 class NeuralNet {

2 public:

3 NeuralNet(ImageIn *_imageIn);

4 NeuralNet(ImageIn *_imageIn , std:: string configString);

5 ~NeuralNet ();

6 int buildNet(std:: string configString);

3.3. Library API 17

7 int train(endOfEpochCallback statisticsCallback);

8 void classify(const float *inputs , int &classification , float &

confidence);

9 int testAnalysis(float *errorMatrix);

where calling the second constructor is equivalent to calling the first followed by buildNet.

The train method takes as input an endOfEpochCallback function pointer, defined as:

typedef void (endOfEpochCallback)(unsigned int epoch , float totalError ,

float stepSize , float trainingPercentage , float saturationPercentage ,

float limitPercentage , float validationPercentage , float

testingPercentage , int numOfBackprops , int numOfForwardprops , float

epochDur , std:: string logFileName);

This endOfEpochCallback callback function will be called at the end of every epoch and

pass parameters to the user regarding the training process. The user can then log these to

a file, print to the screen, or perform any other analysis and/or logging of their choosing.

The classify method can be used to perform inference on a single vector input.

The testAnalysis method, when called, iterates through the testing set and fills an error

matrix indicating how many vectors of each classification were misclassified, and what the

network classified them as.

3.3.2 ImageIn

ImageIn is a virtual class, designed to be inherited by implementations for different plat-

forms, including CPU and GPU. The main application passes a reference of its ImageIn to

the NeuralNet via its constructor (as seen in the documentation above), which then directly

interacts with it to get input vectors during training. ImageIn handles shuffling of the train-

ing vectors, as well as input normalization and augmentation. Input normalization includes

the following settings:

enum scalingMode_t { SCALING_NONE , SCALING_DEFAULT , SCALING_INDIVIDUAL ,

SCALING_GLOBAL };

SCALING_NONE, of course, provides the NeuralNet with the input as is, without performing

any additional normalization on it. SCALING_DEFAULT translates data from being between

0 and 255 to -0.5 and 0.5. SCALING_INDIVIDUAL may be a useful feature for datasets such

as the NSL-KDD dataset [55], where some features are continuous and some discrete, with

no predefined range that can be used for normalization. Therefore, this feature is not as

applicable to pictures. In this case, the training set is scanned to find the mean and standard

deviation of each feature across the set, which are used to normalize the samples. Finally,

SCALING_GLOBAL scans the entire training set to find the global mean and standard deviation,

which are then used to normalize inputs.

3.4 MLP

An Artificial Neural Network (ANN) is composed of a collection of objects called neurons. In

a Deep Neural Network (DNN), many neurons are connected in parallel in a single layer, and

the output of each layer is used as the input to another layer [56]. In Multilayer Perceptron

(MLP) Neural Networks (NNs) specifically, all the outputs of a given layer are connected as

inputs to each of the neurons in the next layer (Fig. 3.1), and each such connection has its

own weight. Each neuron acts as a multiplier and adder, by taking several inputs, multiplying

each by a corresponding weight, and adding those weights plus a bias bj: netj :=∑

i xiwij+bj.

This sum is then typically passed through a non-linear activation function, thus the output

of the neuron is Oj = φ(netj) (Fig. 3.2).

3.4. MLP 19

Figure 3.1: Sample multilayer perceptron neural network with four inputs, one hidden layerwith six neurons, and three outputs.

Figure 3.2: A neural network neuron j calculates its output based on the inputs, its weightsand bias, and activation function.

3.5 CNN

An MLP architecture’s full connectivity is a simple structure that is easy to expand or

reduce in size for varying applications. Its disadvantages, however, compared to the more

complex Convolutional Neural Network (CNN) are that it requires more weights and hence

more memory, and in cases of image processing as discussed in this paper, an MLP loses

the spatial information of its input, that is, any two dimensional information about relations

between neighboring pixels [57, 58]. This is because CNNs are designed to require far fewer

weights by having neurons share weights.

In a CNN, a single layer is composed of several feature maps. Each feature map has a

specified kernel size, such as n× n, which uses n2 ·mx−1 weights, where mx−1 is the number

of feature maps in the previous layer. All these weights are multiplied by their inputs to

calculate the net, similarly to in MLPs. Kernels may overlap with each other, sharing their

inputs, as they form the new feature maps. The key feature of CNNs, however, is that the

neurons, or kernels, in a given feature map share their weights. This allows for a feature

map to train to recognize a certain feature in the input, then scan for said feature anywhere

in the input without needing the redundancy of training separate weights for every neuron

to recognize that pattern. This enables CNNs to better handle spatial information, while

reducing the amount of weights needed.

3.6 Pooling

A pooling layer is frequently added after a convolutional layer. Pooling layers offer a form

of nonlinear downsampling of their inputs by reducing a block of congruent pixels into a

single datapoint. This allows for a smaller network, and helps detect features in approxi-

3.7. Activation Functions 21

(a) Max Pooling (b) Average Pooling

Figure 3.3: Two types of pooling: max pooling and average pooling, which may be addedafter a convolutional layer in order to downsample its input.

mate subregions. There are several possible types of pooling, and this library offers both

max pooling and average pooling. In max pooling, a single output neuron is assigned the

maximum value of its inputs. Average pooling assigns each neuron’s output the average of

its inputs. These methods are illustrated in Fig. 3.3. In this example, the input layer is 4 by

4, or 16 datapoints overall, and a pooling layer with a kernel size of 2 results in a reduction

by 1/2n where n is the number of dimensions, which in this case evaluates to 1/4, leading to

an output with only 4 datapoints. [59]

3.7 Activation Functions

In order for the network to not be a simple weighted sum, neurons pass their calculated net

value through a nonlinear activation function to compute their outputs, as seen in Fig. 3.2.

This introduces nonlinearity to the neural network. There are several activation functions

referenced in literature. Traditionally, the popular activation functions included hyperbolic

tangent (inverse tangent between -1 and 1), defined as f(net) = 21+e−2·net − 1.0 , and logistic

(inverse tangent between 0 and 1), defined as f(net) = 11+e−net [58, 60]. In recent years,

however, Rectified Linear Unit (ReLU) has become the most popular activation function [30,

61], thanks to its proven performance as well as the minimal mathematical computation it

Figure 3.4: Commonly used activation functions. In recent years, ReLU has become themost popular.

requires. ReLU (Rectified Linear Unit) is defined as f(net) = max(net, 0) . These activation

functions are plotted in Fig. 3.4. Note that the activation function must be derivable for

the backpropagation algorithm. In the case of ReLU, a piecewise function, the derivative of

each piece is used, regardless of discontinuity.

Another type of activation function is the Softmax function. Softmax can be used after the

final layer of the network in order to normalize the outputs so that all the confidences add up

to 1. This functionality is also implemented in the library. Softmax is defined in more detail

in Section A.2, along with its Cross Entropy loss function. The section also discusses how

using Softmax as an activation function affects backpropagation and derives the appropriate

mathematical equations.

3.8. Training 23

3.8 Training

Before the network can be used to classify inputs, it must first be trained. When the training

begins, the network initializes weights randomly within a given range such that the net for

each neuron will begin in the activation function’s active range. Then, the weights must be

tuned to provide better results. In supervised learning, this involves forward propagating

the training input vectors to get the network’s output. This output is then compared to

the target output in order to calculate an error using some loss function such as (A.1) or

(A.19). Then backpropagation is used to calculate each weight’s contribution to the error,

and gradient descent is used to adjust the weights in an effort to minimize said contribution.

In a simplistic implementation of batch mode, all the training vectors would be forward

propagated, their error would be computed, and the weights would be updated at the end of

an epoch, or an iteration over the entire set. However, this would result in training taking

far too long, and does not always yield the best results [62]. If the weights are updated

after every input vector (a method known as stochastic gradient descent) the weights are

updated much more frequently, but the gradient would be noisier. The library instead offers

a variable mini-batch size, which can be set to 1 for stochastic mode, or otherwise set to any

number of patterns, or input vectors, that should be used to get an average gradient for a

single update of the weights.

The math needed for the forward pass and backpropagation steps of training a neural network

is derived in Appendix A.

3.9 dE/dw

During the training process, a source of variability in common implementations of neural

networks is the rate at which weights are changed. When using gradient descent, weights

are updated iteratively using the equation

wij(k + 1) := wij(k)− η ∂E∂wij

where wij(k + 1) is the weight neuron j gives input i at iteration k + 1, wij(k) is the weight

neuron j gives input i at iteration k, η is the learning rate, and ∂E∂wij

is the partial derivative

of the error with respect to weight wij. In this equation, the learning rate η, depending

on the algorithm, may be constant, decrease exponentially, linearly, or be defined as a step

function. The weights and ∂E∂wij

can be seen as multidimensional vectors. The learning rate,

by theory, can control the magnitude of the step, while ∂E∂wij

provides the direction for the

change. However, the magnitude of η ∂E∂wij

is not constant unless ∂E∂wij

is normalized, since

∂E∂wij

’s magnitude is a function of the total error E. The result of not normalizing ∂E∂wij

is that

a larger error would result in a larger magnitude of ∂E∂wij

and hence is equivalent to a larger

step size, while a smaller error results in a smaller step size [63]. This vanishing gradient,

while not an incorrect behavior, is difficult to account for if designing more complex functions

for the learning rate. The library implemented offers an option to normalize the ∂E∂wij

vector

prior to updating the weights, thus eliminating an unaccounted for source of variability. This

essentially turns the equation above into−−−−−−→wij(k + 1) :=

−−−→wij(k) − η‖

−−→∂E∂wij‖. In doing so, the

learning rate entirely controls the magnitude of the step, while ∂E∂wij

controls the direction.

This allows for more complex learning rate functions to be implemented, which may be a

function of the error, but can also take into account other parameters, such as confidence,

epoch or batch number, etc.

3.10. Validation, Testing, and Weights Logging 25

3.10 Validation, Testing, and Weights Logging

As a neural network is training, its weights are tuned from random values to values producing

the wanted output more and more often. Since the training process is long and rigorous, and

the trained network needs to be reusable for quick inference in the final applications, there

is a need to be able to save the state of the network as it is training. Specifically, this means

the weights (and biases) must be stored in order to be reused for the network to be restored

to its previous state. Different libraries and implementations of neural networks use different

approaches to storing weights. Often times, the weights are recorded on a periodic basis, such

as every 5000 mini-batches. Another method is to only record the state of the network when

the network’s performance improves, in order to attempt to capture the network at its best

performing state. Therefore, it is important to assess the performance of the network during

the training routine. However, simply looking at the error as a percentage classified correctly

or the error defined in (A.1) and (A.2) on the training vectors gives a false measurement as

those errors assess the network’s performance on inputs it has already been exposed to and

trained to fit, and therefore would optimistically extrapolate its behavior on such (easier)

vectors to vectors it has not seen, on which performance would be worse. Instead, a separate

set of testing vectors is used, which the network has not been exposed to. This set offers a

more objective assessment of the network’s performance level.

The library developed also makes use of a third set, used for validation. Like the testing

set, the validation is composed of vectors the network has not trained on, and therefore

offers an unbiased assessment of its performance. Overfitting occurs when the network has

been exposed to a finite set of training inputs which do not accurately portray the entirety

of the true classification. When this occurs, the training error may continue to decrease,

while the error on an independent set of inputs starts to increase. Since the testing set

is used to measure performance and should not bias the network’s training routine in any

Figure 3.5: Augmentation visualization tool, showing the original image on the left and theaugmented version on the right.

way, the validation set is used to monitor for such cases. The library implemented saves

the network’s weights to a binary file in order to be used later for continued training or for

inference purposes. The training routine was designed to use the validation performance

as an indicator of when the network’s performance is increasing, and only save the weights

then, in order to ensure the best state of the network is the one saved for future use.

3.11 Augmentation

Overlearning, or overfitting, as described above, is a phenomenon that is likely to occur if the

training set is too small, or does not well represent the true categories. For example, when

training on MNIST, if most pictures of a 3 had the top right pixel turned on, the network

training may notice this pixel, and give it a great weight since it highly correlates with the

output being 3. At first, as the weights move from random, which gives approximately a 10%

accuracy, and converge towards better results, the training, validation, and testing would all

improve. However, a 3 does not really have a dot at the top right corner, and eventually the

training would continue improving as the network picks up on finer detail patterns in the

training set that are not truly characteristics of the classes, but the validation and testing

3.12. Input File Parameters 27

would not share these finer details and as such would begin to deviate from the performance

of the training. One solution to avoiding this result is to use data augmentation in order

to artificially expand the training set and expose the network to more possibilities. The

library adds support for this in the form of three optional transformations. The first is

translation, meaning slightly moving the picture randomly along the X and Y axis. this

allows the network to recognize the images even if they are slightly displaced. The second

transformation is rotation, which rotates the images randomly a few degrees clockwise or

counter clockwise. The final transformation is shearing the image, which slightly stretches

and compresses different aspects of it, exposing it to different slants.

Since these transformations are not easy to compute and can introduce bugs, a visual utility

was developed with the Qt framework to demonstrate an input vector before and after its

transformation. The tool, seen in Fig. 3.5, allows the user to select image and label files, and

step through the different vectors or input an index to jump to using the number window

on the right and the up and down arrows. It displays the label of the figure in the top

left window number, and using the right and left arrow boxes the user can seek the next or

previous vector with the same label, allowing the user to quickly browse images of the same

category and see how similar they look. Finally, three check-boxes are available on the right

to select which augmentations should be used. The image is then augmented and displayed

on the right, allowing the user to see the before and after input vectors.

3.12 Input File Parameters

The library may be configured with its public buildNet function. The main application can

therefore parse this string from a file or other input. The file format allows for comments

using the # sign. The parser is case insensitive and ignores white spaces. The expected

format is a key-value pair every line, separated by tabs or spaces. For example, a line in the

file meant to control the learning rate for training the network may read:

LEARNING_RATE 0.01 # Configuring initial learning rate

Or configuring the file base name to be used for the weights and log can be done using:

File_name MNIST_run_1 # weights will be saved in weights_MNIST_run_1.

bin and output log in log_MNIST_run_1.csv

Certain parameters, however, require a more complex configuration than a key-value pair.

For example, configuring a convolutional layer. This requires specifying the number of maps,

the kernel size, the kernel stride, and the activation function. To do this, we use the word

convolutional as key, and braces are used to provide a block of key-value parameters as

convolutional’s value:

CONVOLUTIONAL {

NUMBER_OF_MAPS 5

KERNEL_SIZE 3

KERNEL_STRIDE 1

ACTIVATION_FUNCTION relu

The following are all the parameter options for the configuration file, with sample values:

LEARNING_RATE 0.01 # initial learning rate

BATCH_SIZE 10 # batch size

MAX_ITERATIONS 2000000 # max number of epochs to train , 0

to disable (stop only when validation score deteriorates by

min_err_delta)

MIN_ERR_DELTA 1.20 # current validation error/best

validation error ratio to stop training at (to prevent further

deterioration)

MOMENTUM_ALPHA 0.9 # momentum alpha. 0 to disable

momentum and only use current dE/dw

ADAPTIVE_LEARNING_RATE 1.0 # adaptive learning rate (> 0, <=

1.0). 1 to disable adaptive learning rate.

WEIGHT_DECAY 0.0 # weight decay coefficient. 0 to

disable weight decay

BP_THRESHOLD 0.9 # backprop threshold. 0 to update

weights whenever wrong , 1 to always update weights (i.e. disable this

mechanism), 0 < x < 1 == update if confidence < x

MAX_FP_DELAY 1 # max epochs to go without a

forward propagation. 1 to disable this feature

AUTO_NORMALIZATION 2 # normalize input vector. 0 ==

default (-128, /256), 1 == individual index normalization , 2 == total

energy scaling , 3 == off

DERIVATIVE_NORMALIZATION 0 # normalize dE/dw vectors. 0 ==

off (default), 1 == on.

DATA_AUGMENTATION 1 # use library ’s data augmentation.

0 == off , 1 == on

WEIGHT_NORMALIZATION 0 # normalize weights. 0 == off ,

otherwise limit to positive value

DROP_OUT 0 # drop out. 0 == off , 1 == on (

only on fully connected layers , 0.5 drop probability for all neurons

except for input layer , 0.2)

FILE_NAME demo # base file name to use for log

and weights

DATA_INPUT { # input data size

DIM_X 28 # can specify DIM_X , DIM_Y and

NUMBER_OF_MAPS. Unspecified dimensions will default to 1.

DIM_Y 28

NUMBER_OF_MAPS 1

CONVOLUTIONAL { # convolutional layer , contains

number of maps in layer , kernel size , kernel stride , and activation

function

NUMBER_OF_MAPS 5

KERNEL_SIZE 3

KERNEL_STRIDE 1

POOLING { # pooling layer , contains kernel

size and type of pooling

KERNEL_SIZE 2

POOL_TYPE max # max or average , defaults to max

CONVOLUTIONAL {

NUMBER_OF_MAPS 10

KERNEL_SIZE 5

KERNEL_STRIDE 1

POOLING {

KERNEL_SIZE 3

POOL_TYPE max # max or average

FULLY_CONNECTED { # fully connected layer , contains

layer size and activation function (logistic , hypertan , relu , softmax)

LAYER_SIZE 50

FULLY_CONNECTED {

LAYER_SIZE 10

ACTIVATION_FUNCTION softmax

Chapter 4

Selective Backpropagation

4.1 Motivation

The training process for a neural network typically involves forward passing a batch of

inputs, calculating an error, and backpropagating to adjust weights based on this error.

The backpropagation process requires the same amount of computation regardless of how

small or large the error is. In some cases the potential benefit of backpropagating a certain

input may be minimal, and cost-benefit analysis of the potential improvement and the time

required to backpropagate may lead to the conclusion that backpropagating would not yield

a productive improvement compared to backpropagating other vectors. If a certain input

is classified correctly and the confidence of the network is high, meaning the error is small,

then completing all the computation for a backpropagation will cause little change in the

network, and likely offer only a minimal improvement. It would be more beneficial to spend

that training time backpropagating vectors the network performs poorly on.

This may especially play a role when training on datasets where some classes are far more

represented than others. Say a network was trained on a subset of MNIST with only 10

images of the digit 3, 200 images of an 8, and 100 images of every other digit. Then on the

few occasions the network gets to train on a 3, it modifies the weights to better recognize a

3. But if an 8 looks like a 3, and is far more common and therefore backpropagated on, then

at some point the potential benefit of further learning a certain 8 may come at the expense

4.2. Procedure 33

of worsening performance on recognizing 3’s.

4.2 Procedure

If the network only backpropagates in cases where its performance is unsatisfactory, as de-

fined by certain criteria, then training time needed to arrive at a desired level of performance

can be reduced. In this work, the filter used for deciding whether or not to backpropagate

is stateless, relying only on the output of the forward propagation. A more complex system

could be designed, such as one tracking the overall performance on each class, previous de-

cisions made about a given vector, etc. However, the stateless system is simpler, requiring

less computation and less memory.

If the categorical classification is wrong, or the confidence is below a certain threshold, then

the network backpropagates on the given input. Therefore, for the network to decide not to

backpropagate following a forward propgation, the classification had to be correct and with a

high confidence, as defined by the user. The confidence threshold is set in the configuration

flag using the BP_THRESHOLD flag, followed by a decimal between 0.0 and 1.0 inclusive.

When this parameter is set to 1.0, the filter is effectively disabled and baseline performance

is achieved, since the network will always backpropagate, as confidence is always less than

or equal to 1. If the parameter is set to 0, then the network backpropagates any time the

network was wrong, and never when it was right, since confidence is always greater than 0.

For any value in between, the network backpropagates when classification was wrong (i.e.

the highest confidence was not of the correct output) or when the classification was correct,

but confidence was lower than the threshold set by the user.

The results discussed in this paper, both for MNIST and CASIA, were collected using the

same network architectures displayed in Fig. 4.3 and Fig. 4.13 respectively, unless otherwise

34 Chapter 4. Selective Backpropagation

Figure 4.1: Number of Backpropagations VS Epoch Duration (s) at 0.9 BP filter.

specified.

4.3 Results

The experiment designed wished to investigate what effect only backpropagating in certain

conditions, rather than for every training input in a given epoch, would have on the training

performance curve. One logical measure, therefore, would be the performance on the testing

set as a function of training time. However, the training time is nondeterministic and

therefore may not be the most accurate measure of performance. The exact same test can

be performed on the same system and take a different amount of time to execute. This

duration can be influenced by the quantity and nature of other programs running on the

system and therefore sharing its resources, as well as variables such as the operating system,

4.3. Results 35

Figure 4.2: Epoch duration during training with BP 1.0 under different environment condi-tions.

its scheduler, and the host system’s resources.

Data discussed here was all collected on the same machine. This machine is equipped with

a 4.2GHz Intel Core i7-7700K processor, and an NVIDIA GeForce GTX 1080 Ti GPU

with 11GB GDDR5X SDRAM, 3584 CUDA cores, and a memory bandwidth of 484GBps.

Additionally, it has a 16GB DDR4-3000 RAM, 2TB HDD operating at 5,400 RPM and

500GB SSD with a SATA 6Gbps interface, and runs Windows 10 Pro.

Graphs in Fig. 4.1 and Fig. 4.2 show several runs of the same network structure on the same

data. For a given network structure, weights were always initialized to the same values, since

the random generator used to generate initial weights was not seeded for tests discussed in

this thesis. In the tests for Fig. 4.1, a small convolutional neural network was trained on

MNIST using the following configuration:

LEARNING_RATE 0.01

BATCH_SIZE 10

MAX_ITERATIONS 2000000

MIN_ERR_DELTA 1.20

MOMENTUM_ALPHA 0.9

ADAPTIVE_LEARNING_RATE 1.0 # adaptive learning rate

disabled

WEIGHT_DECAY 0.0 # weight decay disabled

BP_THRESHOLD 0.9 # BP if wrong or confidence

below 0.9

MAX_FP_DELAY 1 # always forward propagate

AUTO_NORMALIZATION 2 # total energy scaling

DERIVATIVE_NORMALIZATION 0 # no derivative normalization

DATA_AUGMENTATION 1 # data augmentation on

WEIGHT_NORMALIZATION 0 # weight normalization off

DROP_OUT 0 # drop out off

FILE_NAME MNIST_SingleRunAffinity

DATA_INPUT {

DIM_X 28

DIM_Y 28

NUMBER_OF_MAPS 1

CONVOLUTIONAL {

NUMBER_OF_MAPS 5

KERNEL_SIZE 3

KERNEL_STRIDE 1

POOLING {

KERNEL_SIZE 2

4.3. Results 37

POOL_TYPE max

CONVOLUTIONAL {

NUMBER_OF_MAPS 10

KERNEL_SIZE 5

KERNEL_STRIDE 1

POOLING {

KERNEL_SIZE 3

POOL_TYPE max

FULLY_CONNECTED {

LAYER_SIZE 50

FULLY_CONNECTED {

LAYER_SIZE 10

ACTIVATION_FUNCTION softmax

In Fig. 4.1, since the BP_THRESHOLD is not set to 1, the number of BPs will vary between

the epochs, allowing a comparison of different BPs completed and training durations of

epochs. Since MAX_FP_DELAY is set to 1, every epoch will forward propagate 54,000 times,

eliminating any unwanted variability. Each epoch propagates 54,000 times since the training

set is composed of 60,000 vectors, of which 10%, or 6,000 vectors are set aside for validation,

leaving 54,000 in the training set.

The only differences between the various runs were in the environment running the tests,

including whether the program’s affinity was set to run on all the cores (default) or on

a single CPU core (coded as ‘affinity’ in the key), and whether this run was the only one

taking place (along with background programs on the computer, and marked as ‘Single run’)

or other runs were also executing simultaneously (‘Multiple runs’). Other than these system

condition changes, the tests themselves were of the same network structure with identical

setup and inputs.

The output from all the runs is therefore identical, except for the training time. For each

run, every epoch was captured as a data point with the backpropagations performed as its X

value and the epoch’s training duration along the Y axis. This graph shows that in different

runs of the same network, while the BPs performed in a given epoch is deterministic, the

duration is not, although there is a positive linear relationship between the two.

The initial epochs are the ones with the most BPs, starting with 54,000 as every image

is backpropagated, and along time the number of BPs performed decreases. This means

that chronologically, the first epochs are the ones plotted on the right of Fig. 4.1, and as

training continued new epochs were plotted to the left. Noticeably, when the CPU was

only running one run simultaneously, an epoch took approximately 40-45s, as opposed to

70s with multiple simultaneous runs. This decrease in efficiency can be attributed to the

CPU being quad-core, so instead of the processor having a core entirely to itself as in the

single run captures, now four cores are being shared between 7 runs. The performance is not

quite twice as slow, however, which is likely due to optimizations in the system, primarily

Intel’s hyper-threading in the i7 7700K CPU. There is also some variation visible between

different runs where the platform was under similar workload, such as the three single runs.

‘Single run’ and ‘Single run 2’ had no known differences in their setup, yet ‘Single run’ has

noise on it when BP’s are less than 5000, and has a slightly larger slope, meaning BPs took

slightly longer. ‘Single run with affinity’ appears to take slightly less time than ‘Single run

2’, showing that setting the affinity can yield a slight improvement in this case, likely thanks

4.3. Results 39

to more cache hits. Some of the runs exhibit some noise for epochs with few BPs. This is not

constant to all runs and is likely due to the scheduler of the OS. Observations were made,

for example, that printing to the console during training time every so often decreased run

time per epoch, supporting the hypothesis that the scheduler may prioritize certain types of

routines over others in a manner that causes these noises in short epochs’ durations. This

noise affects the R2 values of the plots, as one may expect. For example, ‘Single Run’ has a

lot more noise up to BP 5600, and because of this its R2 is only 0.727, compared to ‘Single

Run 2’ which is an identical run that did not show such noise for low BPs and has an R2

value of 0.997, showing it is much closer to the fitted regression line.

In Fig. 4.2, the only change made from Fig. 4.1 was setting BP_THRESHOLD to 1 instead

of 0.9, meaning the network always backpropagates. In this scenario, since the network

always computes the same number of forward propagations and backpropagations, there is

no variation in the amount of computation performed per epoch, and therefore the runtime of

the different epochs should be constant. This is the observed behavior, as seen in the graph,

in each run. However, running multiple runs simultaneously does cause an increase of the

amount of time needed to complete a single epoch, nearly doubling the epoch duration from

around 41s to about 75s. Additionally, with multiple runs happening in the background, the

epoch duration is not nearly as constant, with epoch duration lasting as long as 88s, while

the single run shows far less variation in epoch durations. The standard deviation of the

multiple runs occurring simultaneously plot is 2.125s and has an R2 value of 5.883 · 10−5,

while the single run has a standard deviation of 0.321s and R2 value of 0.211.

Due to these observations, it can be noted that there is a high correlation between epoch

duration and BPs performed, and while measurements of epoch duration are nondetermin-

istic, a count of BPs performed in a given epoch is deterministic and therefore serves as a

better measurement of the effectiveness of the experiments described here.

Figure 4.3: Architecture of the CNN used to classify MNIST.

4.3.1 MNIST

The results discussed were obtained by training the network illustrated in Fig. 4.3.

Fig. 4.4 shows the results for training this neural network on MNIST with different selective

BP threshold values. The learning rate used for these runs was 0.1, batch size 10, and

momentum of 0.9. Inputs were normalized and augmented as described in Ch. 3. Fig. 4.4

plots the testing accuracy from these runs as a function of epochs passed. The various curves

have no significant variation between them, with the baseline plateauing at 98.75% and others

around it between 98.6% and 99.0%, with oscillations of around 0.1%, the curves overlap

each other. This shows that ignoring certain inputs does not cause an overall decrease in

performance. Additionally, none of the curves show signs of overlearning, which may have

been a concern from limiting the exposure of the network to inputs. This may be attributed

to the augmentation of inputs. If overlearning was occurring, the testing performance would

eventually begin to decrease after having reached its peak, but instead it appears to plateau.

Fig. 4.5 shows the same runs as Fig. 4.4, plotted as a function of BPs performed rather

than epochs. As discussed earlier, this is a good indication of training time. The graph

4.3. Results 41

Figure 4.4: MNIST testing accuracy with different BP thresholds. When plotting accuracyas function of epochs passed, the curves have no significant variation between them.

shows that when the selective BP threshold is set to 1.0, i.e. always backpropagate, which

is the baseline for these experiments, the curve takes longer to achieve the same testing

classification accuracy as when the BP threshold is reduced. After approximately 15M BPs,

this curve catches up to the others’ performance, but this is long after they plateau, and

the 1.0 BP threshold does not exceed the other curves’ performance. The initial part of

the graph is zoomed in on in Fig. 4.6, where it is clear that all the curves plateau after 1M

to 2.5M backpropagations, at a performance that the baseline begins to reach around 15M.

In these runs, reducing the BP threshold from 1.0 to a value between 0.4 and 0.9 shows a

reduction of BPs performed to 6.67%-16.67% of the baseline. In these figures, a datapoint

was sampled at the end of every epoch. In runs where BP threshold is not set to 1.0, epochs

are composed of less BPs and therefore completing X BPs takes more epochs, thus contains

more sample points. This is the reason that graphs with BP 6= 1.0 appear to be thicker.

Fig. 4.7 shows the number of backpropagations computed when the threshold is set to 0.7,

Figure 4.5: MNIST Testing accuracy with different BP thresholds. BP 1.0, the baseline ofalways backpropagating, takes longer (more BPs) to achieve the same performance as othercurves on the graph. It catches up but does not show better performance than runs with alower BP threshold in the long run.

Figure 4.6: Zooming in on the initial relevant section of MNIST Testing accuracy withdifferent BP thresholds. BP 1.0, the baseline of always backpropagating, initially significantlyunder performs runs that selectively backpropagate, and takes time to catch up to them afterthey plateau.

4.3. Results 43

Figure 4.7: Plotting both the performance on the testing set as well as the numbers ofBPs performed in each epoch shows the rapid decrease in BPs performed, dropping to below10,000 after only 7 epochs from the initial 54,000, which would remain constant for a baselinetest. By that point, performance reaches 96.04% on the testing set. It goes on to pass 98.7%,the majority of that time spent completing less than 5,000 BPs per epoch.

along with the performance on the testing set, both as a function of epochs completed. This

graph shows the immediate and rapid decrease in BPs performed during training. For a

baseline experiment, the BPs curve would be constant at 54,000. Instead, in this plot the

0.7 BP threshold immediately decreases in BPs, dropping below 10,000 BPs/epoch after only

7 epochs when accuracy is at 96.04% and continues to drop, ultimately reaching below 500

BPs/epoch. The % accuracy achieved on the testing set continues to improve when fewer

BPs are completed, managing to reach 99% for the first time by epoch 355, at which point

only around 1,300 BPs are completed per epoch.

Figure 4.8: Testing accuracy over BPs on disproportional MNIST dataset.

Disproportional Dataset

One of the advantages of employing this technique is essentially closing the feedback loop

on which vectors should be reinforced and which the network is already good at classifying.

When an effort is put into making sure datasets represent classes equally, this is less necessary,

but that can be hard to do under real-world conditions. For example, when trying to

create a dataset to help predict rare events such as earthquakes, medical emergencies, or

cyber attacks, data collection for when the event is occurring may be much more difficult

to collect at large quantities than data for when the event is not happening. In these

scenarios, imbalanced datasets may be generated. The most common methods to train on

such imbalanced datasets include down-sizing, or under-sampling, the more common classes,

or alternatively over-sampling the underrepresented classes [64, 65, 66, 67]. This behavior

may also be valuable in applications where the network is being trained on live data as it

4.3. Results 45

is being collected, rather than a dataset composed offline, since in such scenarios it is not

possible to ensure equal representation of the different classes.

MNIST specifically is composed of 60,000 training images, with 6,000 training images for

each class, and a testing set of 10,000 images, composed of 1,000 for each label. To test less

ideal conditions, a subset of the MNIST training set was created with an unequal distribution

of labels. In this case, 100% of images with labels 0-8 were saved, but only 1% of images

labeled 9 were randomly selected for the training set. The testing set was not altered, so

as to measure the true performance of the network. The network structure discussed earlier

was then trained with this new dataset. The learning curve with and without the selective

BP algorithm are plotted in Fig. 4.8, which shows BP 0.7 first passing 96% accuracy on the

unbiased testing set after 372k BPs, compared to BP 1.0 which takes 3.6M BPs, or 9.68

times as many BPs (an 89.67% improvement), before they plateau to similar final results.

The overall improvement in time it takes to train the network does not paint the full picture,

however. Since the training set used was biased to include far less pictures of 9’s than any

other class, the distribution of errors between the classes offers valuable insight. Fig. 4.9

shows the error matrix early in the training process for each of the two runs. The Y axis

marks the true class, the X axis marks the neural network’s classification, and the Z axis

marks the number of vectors from the testing set that were classified incorrectly. For the

diagonal where the X value equals the Y value, the errors are always 0, of course, since those

are columns where the NN classification is the true classification, and therefore this cannot

be an error. The two subfigures show the distribution of errors, and it is clear that the

overwhelming majority of errors in both tests is in class 9, which is to be expected as the

network had far less examples of 9’s to study. The two subfigures show a similar distribution

of the errors, with the most common error by far being classifying a picture of a 9 as a 4,

an understandable error to make due to their similar shape, and the second most common

(a) Disproportional MNIST error matrix early in the training process, with 0.7 BP thresh-old.

(b) Disproportional MNIST error matrix early in the training process, with 1.0 BP thresh-old.

Figure 4.9: Error matrices early in the training process, training a neural network on adisproportional MNIST dataset.

4.3. Results 47

Figure 4.10: Histogram of total misses per class early in the training process.

error being classifying a 9 as a 7, another error that makes sense when 7 is written with a

horizontal line crossing its center. Fig. 4.10 shows the sum of errors in each classification

for the two runs, which clearly shows, similarly to Fig. 4.9, that BP 0.7 was able to make

less than half the mistakes BP 1.0 did. This agrees with Fig. 4.8 in showing that using the

BP threshold results in training converging much faster than when backpropagating every

input.

As Fig. 4.8 shows, however, if the network is given enough training time, the two methods

converge on similar performance. Fig. 4.11 shows a similar error matrix evaluated after

the results plateau, and shows that errors still came overwhelmingly from misclassified 9’s.

While the BP 0.7 seems to have improved at differentiating 4’s and 9’s, reducing that type

of error by more than half, it made little improvement in telling apart an 8 from a 9, while

the BP 1.0’s primary mistake is still classifying 4’s as 9’s, the two network’s performance

overall is equal, as is also seen in Fig. 4.12. While this summation does show BP 0.7

(a) Disproportional MNIST error matrix early in the training process, with 0.7 BP thresh-old.

(b) Disproportional MNIST error matrix early in the training process, with 1.0 BP thresh-old.

Figure 4.11: Error matrices at the end of the training process, training a neural network ona disproportional MNIST dataset.

4.3. Results 49

Figure 4.12: Histogram of total misses per class early in the training process.

slightly outperforming BP1.0 in classifying 9’s, this difference is likely negligible and could

probably be captured reversed or at least further reduced if a few more random captures

were generated for another couple of epochs. The fact that the BP 1.0 network was still able

to converge at over 96% accuracy, compared to around 98.8% achieved when training on the

regular set, is a rather impressive feat that speaks to the ability of convolutional networks.

It is possible that if this system was further tested to its limits with a smaller network with

less weights and/or a more extreme dataset, not only would the difference in speed to reach

peak performance widen between the two, but the non-1.0 BP would potentially outperform

the 1.0 BP threshold. Additionally, it is likely that with more variety in the augmentation,

either as additional forms of augmentation or a larger range of performance (i.e. further

rotating, etc.), the peak performance achieved could be even higher than that achieved here.

Figure 4.13: Architecture of the CNN used to classify CASIA.

Figure 4.14: CASIA testing accuracy with different BP thresholds. When plotting accuracyas function of epochs passed, the curves show no significant variation between them.

4.3. Results 51

4.3.2 CASIA

The same tests described above were performed on the CASIA dataset described in Sec-

tion 2.2, using the network architecture shown in Fig. 4.13. Similarly to the MNIST per-

formance, Fig. 4.14 shows that results plotted over epochs show no significant variation in

performance. However, epochs composed of less BPs are able to complete faster. Fig. 4.15

shows the performance of these runs on the testing set, plotted as a function of BPs per-

formed, instead of epochs. This graph shows the 1.0 BP threshold curve staggering behind

the other curves, with the gap appearing to grow as training continues. Fig. 4.16 offers a

closer look at the relevant section of this graph. It can be seen that while BP below 1.0

curves cross the 90% accuracy after 1M BPs, the baseline curve does so after 1.64M curves.

Additionally, while the 1.0 BP threshold curve only reaches 92% accuracy for the first time

after 2.88M BPs, other curves reach such performance after 1.1M to 1.3M BPs, showing a

54.86%-61.81% decrease of BPs performed to achieve that level of accuracy of the testing

set. Furthermore, the trend lines appear to have an expanding gap between them, indicating

that this margin of difference would likely continue to expand with further BPs. This data

agrees with the overall performance seen on MNIST in Section 4.3.1, showing that any BP

threshold below 1.0 outperforms the baseline in training time needed to achieve a given level

of accuracy. While the extent of time improvement is a function of the threshold, and differs

across the two datasets, this shows that the algorithm does reduce training time on small

problems such as MNIST, with 28x28 inputs and only 10 outputs, as well as bigger networks

such as the one used here for CASIA, with 48x48 inputs and 100 outputs.

Figure 4.15: CASIA Testing accuracy with different BP thresholds. BP 1.0, the baseline ofalways backpropagating, stagnates behind all other curves, taking at least 2.2 times as longto reach 92% testing accuracy.

Figure 4.16: Zooming in on the relevant section of CASIA Testing accuracy with differentBP thresholds. BP 1.0, the baseline of always backpropagating, significantly under performsruns that selectively backpropagate, and shows no sign of catching up to the rate at whichthey improve.

Chapter 5

Selective Forward Propagation

5.1 Motivation

Chapter 4 discusses only backpropagating when the network is wrong or with a low confi-

dence. Doing so reduces the need to perform computation to alter the network when the

benefit would be minimal, and can reduce the computation on a given input by approximately

one half. However, there is still potential to further reduce training time. An argument could

be made that when an input vector is forward propagated but not backpropagated, the state

of the network remains unchanged and therefore the forward propagation is redundant. If

there was a way to predict when an input vector will be identified correctly and not back-

propagated, then those forward propagations could be avoided. This could reduce training

time by a much larger margin, and training time saved on those forward propagations can be

better used on the more challenging inputs. The reason this can potentially reduce training

time is as follows: normally, for N vectors in a training set, an epoch would be made up of:

Ec = N · FPc +N ·BPc (5.1)

54 Chapter 5. Selective Forward Propagation

where Ec, FPc, and BPc represent the computation required for a single epoch, FP, and BP,

respectively. When we reduced the BP threshold from 1.0 in Chapter 4, this offered:

Ec = N · FPc + a ·N ·BPc (5.2)

for some a ≤ 1.0, where a represents the percentage of vectors from the ones forward prop-

agated that were then also backpropagated. However, in introducing the filtered FP, we

Ec = b ·N · FPc + a · b ·N ·BPc (5.3)

for some a, b ≤ 1.0, where b represents the percentage of vectors from the entire set that

were forward propagated, and therefore, considered for backpropagation. This means that

the reduction offered by introducing b to the equation reduces both the number of BPs and

of FPs performed, unlike before when a only affected the number of BPs. Effectively, there

could have been a situation with a massive training set, where very few vectors were still not

passing successfully. If always forward propagating and only selectively backpropagating, Ec

would be reaching towards Ec ≈ N · FPc. If, however, the FP estimation works well, this

may be further reduced by reducing the FPs that take place as well, as now the amount of

time that was spent on forward propagating would take up the majority of training time.

It is important to note that when backpropagation was skipped, there was definitive proof

that the network was well trained on the given vector, hence reason to believe that the cost-

benefit analysis of backpropagation would not be worth the expected return on investment

of training time and resources compared to other potential vectors. In the case of skipping

a forward propagation, however, the decision is based solely on a prediction. Additionally,

in certain scenarios, skipping backpropagation may offer an improvement in absolute per-

formance compared to always backpropagating, such as if many inputs of the same class

5.2. Procedure 55

Table 5.1: Analysis of BP and FP combinations, showing potential benefits in performanceand time reduction.

Always FP Selective FP

AlwaysBP

Standard method.Time: better potential improvementPeak performance: minimal potentialimprovement

SelectiveBP

Time: some potential improvementPeak performance: better potential im-provement (Discussed in Chapter 4)

Time: best potential improvementPeak performance: same as Always FP,Selective BP

move the network weights away from the training completed by a much less frequent class.

Therefore, the method discussed in Chapter 4 may improve both training time as well as po-

tentially final performance on a test set. Avoiding forward propagating certain input vectors

based on predictions offers no further potential improvement of classification performance

over employing the BP filter, but does offer such improvement over the baseline of always

forward propagating and backpropagating, and also offers a potentially greater reduction of

training time than always forward propagating but not always backpropagating, which was

already shown to reduce training time over the baseline of always forward propagating and

backpropagating. These scenarios are represented in Table 5.1

5.2 Procedure

Similarly to Chapter 4, the confidence from forward propagating a certain vector is used here

to determine whether or not future computations can be reduced. The model implemented

here determines for each training input how many epochs can likely be skipped before there

is a need to re-examine the given training input vector. This requires storing N values, where

N is the number of training vectors. Initially, Dn, the delay for vector n, measured in epochs,

is set to 0, so all vectors forward propagate at least once. Then, based on their performance,

a delay can be calculated. Two variables are created for this. First, a maximum FP delay (in

epochs) is defined as the maximum number of epochs that may take place before a vector is

re-examined. Additionally, a FP threshold fraction is defined, similarly to the BP threshold,

where if the confidence is below the threshold then the delay is set to forward propagate

again during the next epoch, since performance on it is currently unsatisfactory. This also

happens if the classification is incorrect, regardless of the confidence. If the classification is

correct, and the confidence is above the threshold, then the delay is calculated as follows:

Dn = Dmax ·Cn − Tfp1− Tfp

where Dn is the new delay for frame n, Dmax is the maximum FP delay, measured in epochs,

Cn is the confidence on frame n, which was correctly classified, and Tfp is the FP threshold

fraction. After the first epoch, in which all training vectors were propagated, there may be

some with delays greater than 1. When training, if the network encounters such a vector,

its delay is decremented by one and it is skipped. When the network encounters a vector

whose delay is 1, it forward propagates it again, calculates a new delay, and backpropagates

it, if the appropriate conditions are met.

A baseline criteria for this setup, therefore, is setting Dmax to 1, meaning the vector will

be reexamined at the next epoch, so no vector is ever skipped. When this is the case, Tfp

becomes irrelevant. By defining some non infinite Dmax, however, we ensure that every

vector will eventually be forward propagated again, and is not taken out of the training pool

after being classified correctly once, since weights continue to change and in the future the

vector may not be classified correctly, or may result in a lower confidence.

A potential issue with this algorithm is its clash with data augmentation. Data augmentation

seeks to create new training vectors from the existing ones, in order to expand the versatility

of inputs the network is exposed to during training. However, when a new delay is calculated,

5.3. Results 57

it is done based on the particular input that was propagated, and not on other augmentations

of it that may be procured in the future and result in other classifications or levels of

confidence. This means that estimating a delay based only on confidence on a single vector

may be less effective on training routines with augmentation.

Another point to consider is that calculating delays in epochs represents how many epochs

may pass before a vector is forward propagated again. However, depending on the size of

the training set, the batch size, and how many input vectors result in a forward propagation

and/or backpropagation during those epochs, in different applications this may mean that

in the same number of epochs, there is a different number of times that weights have been

updated. This could be solved by altering the delay from being calculated in epochs to

calculating how many weight updates may be skipped before re-examining the vector.

When a vector is skipped during training, a forward propagation is avoided, as well as a

backpropagation, hence the increased value in not forward propagating. If the BP threshold

is not set to 1.0 while the FP selection algorithms are enabled, then some vectors may be

forward propagated and backpropagated, some may be avoided entirely, and some may be

forward propagated but not backpropagated. Baseline performances in these experiments

involve always forward propagating, and always backpropagating. In Chapter 4, the scenario

where all vectors are forward propagated and only some backpropagate was introduced and

examined. This chapter studies the scenario of only forward propagating some vectors, but

backpropagating all of those, in order to specifically measure this algorithm separately from

the work discussed in Ch. 4. The two algorithms can however be used jointly.

Figure 5.1: Number of Forward Propagations VS Epoch Duration (s) at FP Threshold=0.5and Max Delay=15, showing a positive linear relationship between FPs and epoch duration.

5.3 Results

Similarly to Fig. 4.1 showing the relationship between BPs and epoch duration, Fig. 5.1

shows that the duration in seconds of each epoch from a training routine corresponds linearly

to the forward propagations performed in that epoch, with an R2 value of 0.993. Following

the same reasoning as before, data plotted in this chapter is also a function of propagations

completed, rather than time.

5.3.1 MNIST

Fig. 5.2 shows the testing accuracy curves for different FP max delays while training on the

MNIST dataset. FP Max Delay of 1, in this case, means every vector is always forward

propagated and as such serves as the baseline. It can be seen that the plots plateau around

98.6% accuracy. Zooming in on the initial section of this plot in Fig. 5.3, the FP 1 curve is

5.3. Results 59

Figure 5.2: Testing accuracy over FPs on the standard MNIST dataset with different FP maxdelays. In all runs, FP threshold is set to 0.5. Over time, all runs converge to approximatelythe same level of performance.

Figure 5.3: Zooming in on the initial relevant section of MNIST testing accuracy withdifferent FP max delays, with FP threshold set to 0.5. FP 1, the baseline of always forwardpropagating and always backpropagating, lags behind other curves.

Figure 5.4: Training on different subsets of MNIST shows using selective BP and FP algo-rithms result in increased generalization variability over the baseline.

seen lagging behind all other curves. While those pass 98% testing accuracy after 451k-775k

FPs and BPs, the FP 1 curve successfully does so after 1.664M FPs and BPs. The FP plots

shown display a reduction of FPs (and therefore also BPs) by as much as 73%.

Generalization

Fig. 5.4 shows the variation in generalization when using the selective BP and FP algorithms.

The MNIST training set of 60,000 images was split into 10 subsets of 6,000 images each. Of

those, 5,400 were used for training and 600 to objectively determine when to stop training.

Then, the network was used to measure the performance on the independent set of 10,000

images. This way, 10 datapoints were generated for each threshold value. This plot shows the

baseline, where BP threshold is 1.0 and the FP Dmax is 1 (i.e., always forward propagating

and always backpropagating), has less variability than the other plots, with a standard

deviation of 0.184, while others vary from 0.198 (FP Dmax 5) to 0.344 (FP Dmax 3).

5.3. Results 61

5.3.2 CASIA

The same experiment of varying the maximum FP delay was conducted on the CASIA

subset described in Section 2.2. The results are plotted in Fig. 5.5, which shows the plots

all plateaued around 92.5% accuracy. The key specifies both the FP threshold and FP max

delays, so FP0.8,5, for example, means FP threshold is set to 0.8 and FP max delays is set

to 5. Fig. 5.6 shows a closeup of the critical part of this graph, where the baseline FP 0.8,

1 is seen lagging behind the other curves before they all plateau. Specifically, this curve

passes 91% testing accuracy after 2.47M FPs, while the other curves do so between 1.47M

and 2.0M FPs, showing a reduction of 19.0%-40.5% of FPs and BPs performed.

Figure 5.5: Testing accuracy over FPs on the CASIA subset dataset with various FP maxdelays with FP threshold set to 0.5. Over time, runs converge to approximately the sameperformance on the testing set.

Figure 5.6: Zooming in on the initial relevant section of Fig. 5.5. FP max delay of 1, thethreshold, is seen lagging behind the other curves before they all plateau.

Chapter 6

Future Work

6.1 Future Work

The work discussed shows promise, but warrants further work to study additional possible

improvements that could be made. There is also a multitude of scenarios to be further

examined.

The selective backpropagation and forward propagation algorithms are designed to accelerate

the rate at which the network trains, and were tested on CNNs started from random weights.

These algorithms, however, can also be applied to other algorithms that use backpropagation

and forward propagation in their training, such as recurrent neural networks. They may also

be beneficial for transfer learning. These scenarios have not been tested and warrant future

work. These algorithms can also be further tested on other datasets, including problems

that do not involve image classification problems.

In tests performed, the selective thresholds for FP and BP were kept constant. However,

many different elements and variables of neural network training are modified during the

training process. This can include pruning and growing networks [60, 68, 69], altering the

learning rate [70, 71], and using dropout [29] and weight decay [58, 72, 73]. Similarly, it may

be possible to gain a higher level of testing accuracy from altering the BP and FP thresholds

during the training. For example, using the algorithms for the first 50 epochs, or after the

64 Chapter 6. Future Work

learning curve appears to plateau, then changing both thresholds to 1.0, thus reverting to

‘normal’ training.

More work can be done to further examine the improvement these algorithms offer. This

includes further fine-tuning of adjustable variables (e.g. momentum, learning rate, weight

decay, etc.), as well as applying these methods to additional datasets. Additionally, testing

different sized network with these algorithms could yield interesting results. Bigger (and

deeper) networks will likely achieve better results, but employing these algorithms may offer

a way to train smaller networks and achieve the same performance by essentially stress-testing

how many weights are needed to get a certain level of performance. This may be valuable,

for example, in embedded applications where there is limited memory and computational

resources. The work in this thesis tested each algorithm separately, so as to independently

assess the improvement offered by each. However, they can also be combined for further

reduction of training time.

Other algorithms could be designed based on the principles discussed in this paper. While the

work here relied largely on the confidence of a classification to determine whether or not to

backpropagate, or when to re-examine a given vector, other methods to make these decisions

could be implemented, including gathering statistics for how well a given category is being

classified, how long the network has been training, and other information available during

the training routine. Creating a more robust estimation system using more parameters may

also improve performance on augmented vectors, as discussed in Section 5.1. As explained

in Section 5.1, measuring FP delay in terms of weight updates rather than epochs may also

improve performance on FP estimation, regardless of whether or not data is augmented. In

fact, perhaps the method that would yield best results on a broad range of application would

be not to hand-design these BP and FP filters, but rather to train a ML algorithm such as

a small neural network to look at all these parameters and replace these systems in deciding

6.1. Future Work 65

whether or not to backpropagate, or how long to wait before forward propagating again.

Finally, an additional undertaking would be to implement these algorithms on a framework

that is capable of running on a GPU. As discussed in Section 3.1, this was the initial intention

for this work but proved to be more challenging than initially anticipated, so to test the

algorithms’ potential yield they were implemented in a serial batch fashion to first study

their potential impact. Now that this work has been done and it has been observed that these

selective algorithms offer a reduction in computation required in training and do not impact

peak performance levels, further work can go into integrating these selective algorithms into

a GPU based framework.

Chapter 7

Conclusions

7.1 Conclusions

Neural networks’ training time can take days or even weeks. This thesis discusses altering

the training routine in order to reduce the time to train a NN, with some impact to th gen-

eralization variability. The modifications proposed are twofold; the first, discussed in Ch. 4,

is to only backpropagate on a given vector if it was classified incorrectly or if the confidence

was below a certain threshold, as opposed to the common method of backpropagating every

vector. The reasoning behind this change is that the network already performs well on this

vector, and would see a greater improvement from spending that training time backprop-

agating other vectors that do not perform as well. This offers a closed feedback loop that

may especially facilitate training on dataset that do not have an equal representation of all

Table 7.1: Summary of time improvements achieved with selective BP and selective FP,including on modified imbalanced MNIST.

% Accuracy atcomparison

Number of propagations % Reduction inpropagations

Baseline ModifiedBP MNIST 98.8% 15M 1-2.5M 83.3-93.3%BP Modified MNIST 96.0% 3.6M 372k 89.67%BP CASIA 92.0% 2.88M 1.1-1.3M 54.86-61.81%FP MNIST 98.0% 1.664M 451-775k 53.43-72.90%FP CASIA 91% 2.47M 1.47-2.0M 19.03-40.49%

7.1. Conclusions 67

classes, or where some classes prove harder to classify than others. The second algorithm

proposed is discussed in Ch. 5, and involves predicting when a certain vector should be for-

ward propagated again. This idea stems from the fact that with the selective BP algorithm,

when a vector skips a BP, it does not change the state of the network, and hence the FP that

takes place in order to decide whether or not to BP wastes training time without improving

the NN. By making a prediction as to how many epochs would likely pass before the specific

vector is backpropagated again, the network can avoid the forward propagations up until

that time, thus avoiding both the FP and BP time, which can be better served on other

inputs. Both algorithms were tested on the MNIST and CASIA datasets. The BP dataset

was also tested on a modified MNIST dataset where not all classes are equally represented

in the training set, and the CASIA dataset was used to create a subset of 100 labels so as

to reduce data collection time. Results from all of these tests are analyzed in the appro-

priate chapters, and collectively summarized in Table 7.1. The BP algorithm showed an

83.3-93.3% improvement of backpropagations completed to achieve a given level of accuracy

on the classic MNIST dataset, an 89.67% improvement on the modified MNIST dataset, and

a 54.86-61.81% improvement on the CASIA subset. The selective FP algorithm, with every

avoidance of FP, also avoids a BP, and showed a reduction of propagations completed of

53.43-72.90% on the MNIST dataset and 19.03-40.49% on CASIA.

Bibliography

[1] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, “Mitosis Detection

in Breast Cancer Histology Images using Deep Neural Networks,” Proc Medical Image

Computing Computer Assisted Intervenction (MICCAI), pp. 411–418, 2013.

[2] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun,

“Dermatologist-level classification of skin cancer with deep neural networks,” Nature,

vol. 542, no. 7639, pp. 115–118, 2017.

[3] Waymo, “On the road to Fully Self-driving,” Waymo Safety Report, p. 43, 2017.

[4] “Tesla Autopilot.” https://www.tesla.com/autopilot, 2016.

[5] Y. Tian, K. Pei, S. Jana, and B. Ray, “DeepTest: Automated Testing of Deep-Neural-

Network-driven Autonomous Cars,” 2017.

[6] D. C. Ciresan, U. Meier, and J. Schmidhuber, “Transfer Learning for Latin and Chinese

Characters with Deep Neural Networks,”

[7] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet, “Multi-digit Number

Recognition from Street View Imagery using Deep Convolutional Neural Networks,”

pp. 1–13, 2013.

[8] A. Buczak and E. Guven, “A survey of data mining and machine learning methods

for cyber security intrusion detection,” IEEE Communications Surveys & Tutorials,

vol. PP, no. 99, p. 1, 2015.

[9] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, “Network Anomaly Detection:

BIBLIOGRAPHY 69

Methods, Systems and Tools,” Communications Surveys & Tutorials, IEEE, vol. 16,

no. 1, pp. 303–336, 2014.

[10] Y. Taigman, M. Yang, and M. Ranzato, “Deepface: Closing the gap to human -level

performance in face verification,” CVPR IEEE Conference, pp. 1701–1708, 2014.

[11] C. Clancy, J. Hecker, E. Stuntebeck, and T. O’Shea, “Applications of Machine Learning

to Cognitive Radio Networks,” IEEE Wireless Communications, vol. 14, no. 4, pp. 47–

52, 2007.

[12] T. Yucek and H. Arslam, “A Survey of Spectrum Sensing Algorithms for Congnitive

Radio Applications,” Proceedings of the IEEE, vol. 97, no. 5, pp. 805–823, 2009.

[13] M. Bkassiny, Y. Li, and S. K. Jayaweera, “A survey on machine-learning techniques in

cognitive radios,” IEEE Communications Surveys and Tutorials, vol. 15, no. 3, pp. 1136–

1159, 2013.

[14] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for im-

age classification,” in Computer Vision and Pattern Recognition (CVPR), no. February,

pp. 3642–3649, 2012.

[15] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-

level performance on imagenet classification,” Proceedings of the IEEE International

Conference on Computer Vision, vol. 2015 Inter, pp. 1026–1034, 2015.

[16] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber, “Deep Big Simple

Neural Nets Excel on Handwritten Digit Recognition,” pp. 1–14, 2010.

[17] K. Ovtcharov, O. Ruwase, J.-y. Kim, J. Fowers, K. Strauss, and E. S. Chung, “Accel-

erating Deep Convolutional Neural Networks Using Specialized Hardware,” Microsoft

Research Whitepaper, pp. 3–6, 2015.

70 BIBLIOGRAPHY

[18] A. Ling, D. Capalija, and G. Chiu, “Accelerating Deep Learning with the OpenCL

Platform and Intel Stratix 10 FPGAs,” tech. rep., Intel, 2015.

[19] J. P. Research, “GPU Developments 2017,” tech. rep., Jon Peddie Research, 2018.

[20] N. P. Jouppi, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley,

M. Dau, J. Dean, B. Gelb, C. Young, T. V. Ghaemmaghami, R. Gottipati, W. Gulland,

R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, N. Patil, A. Jaf-

fey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,

J. Laudon, J. Law, D. Patterson, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacK-

ean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, G. Agrawal, R. Narayanaswami,

R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross,

A. Salek, R. Bajwa, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter,

D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, S. Bates, H. Toma, E. Tuttle,

V. Vasudevan, R. Walter, W. Wang, E. Wilcox, D. H. Yoon, S. Bhatia, and N. Boden,

“In-Datacenter Performance Analysis of a Tensor Processing Unit,” ACM SIGARCH

Computer Architecture News, vol. 45, no. 2, pp. 1–12, 2017.

[21] A. P. Engelbrecht, “Sensitivity analysis for selective learning by feedforward neural

networks,” Fundamenta Informaticae, vol. 46, no. 3, pp. 219–252, 2001.

[22] M. T. Vakil-Baghmisheh and N. Pavesic, “Training RBF networks with selective back-

propagation,” Neurocomputing, vol. 62, no. 1-4, pp. 39–64, 2004.

[23] M. P. Craven, “A Faster Learning Neural Network Classifier Using Selective Back-

propagation,” Proceedings of the Fourth IEEE International Conference on Electronics,

Circuits and Systems, vol. 1, pp. 254–258, 1997.

[24] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal Loss for Dense Object

BIBLIOGRAPHY 71

Detection,” Proceedings of the IEEE International Conference on Computer Vision,

vol. 2017-Octob, pp. 2999–3007, 2017.

[25] A. Shrivastava, A. Gupta, and R. Girshick, “Training Region-based Object Detectors

with Online Hard Example Mining,” 2016.

[26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to

document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2323, 1998.

[27] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.

[28] I. Witten, E. Frank, M. Hall, and C. Pal, “Data mining: Practical machine learning

tools and techniques,” 2016.

[29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:

A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine

Learning Research, vol. 15, pp. 1929–1958, 2014.

[30] Q. V. Le, N. Jaitly, and G. E. Hinton, “A Simple Way to Initialize Recurrent Networks

of Rectified Linear Units,” pp. 1–9, 2015.

[31] C. L. Liu, F. Yin, D. H. Wang, and Q. F. Wang, “CASIA online and offline Chi-

nese handwriting databases,” Proceedings of the International Conference on Document

Analysis and Recognition, ICDAR, pp. 37–41, 2011.

[32] Dalbir and S. K. Singh, “Review of Online & Offline Character Recognition,” Interna-

tional Journal Of Engineering And Computer Science, vol. 4, no. 5, pp. 11729–11732,

[33] C. L. Liu, F. Yin, D. H. Wang, and Q. F. Wang, “Chinese handwriting recognition con-

test 2010,” 2010 Chinese Conference on Pattern Recognition, CCPR 2010 - Proceedings,

no. November, pp. 1100–1104, 2010.

72 BIBLIOGRAPHY

[34] D. Ciresan and J. Schmidhuber, “Multi-Column Deep Neural Networks for Offline Hand-

written Chinese Character Classification Multi-Column Deep Neural Networks for Of-

fline Handwritten Chinese Character Classification,” 2013.

[35] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,

A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Is-

ard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga,

S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Tal-

war, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wat-

tenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning

on heterogeneous systems,” 2015.

[36] F. Chollet and others, “Keras.” https://keras.io, 2015.

[37] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,

and T. Darrell, “Caffe: Convolutional Architecture for Fast Feature Embedding,” arXiv

preprint arXiv:1408.5093, 2014.

[38] The Theano Development Team, R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller,

D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Belopolsky, Y. Ben-

gio, A. Bergeron, J. Bergstra, V. Bisson, J. B. Snyder, N. Bouchard, N. Boulanger-

Lewandowski, X. Bouthillier, A. de Brebisson, O. Breuleux, P.-L. Carrier, K. Cho,

J. Chorowski, P. Christiano, T. Cooijmans, M.-A. Cote, M. Cote, A. Courville, Y. N.

Dauphin, O. Delalleau, J. Demouth, G. Desjardins, S. Dieleman, L. Dinh, M. Ducoffe,

V. Dumoulin, S. E. Kahou, D. Erhan, Z. Fan, O. Firat, M. Germain, X. Glorot,

I. Goodfellow, M. Graham, C. Gulcehre, P. Hamel, I. Harlouchet, J.-P. Heng, B. Hidasi,

S. Honari, A. Jain, S. Jean, K. Jia, M. Korobov, V. Kulkarni, A. Lamb, P. Lamblin,

E. Larsen, C. Laurent, S. Lee, S. Lefrancois, S. Lemieux, N. Leonard, Z. Lin, J. A.

BIBLIOGRAPHY 73

Livezey, C. Lorenz, J. Lowin, Q. Ma, P.-A. Manzagol, O. Mastropietro, R. T. McGib-

bon, R. Memisevic, B. van Merrienboer, V. Michalski, M. Mirza, A. Orlandi, C. Pal,

R. Pascanu, M. Pezeshki, C. Raffel, D. Renshaw, M. Rocklin, A. Romero, M. Roth,

P. Sadowski, J. Salvatier, F. Savard, J. Schluter, J. Schulman, G. Schwartz, I. V.

Serban, D. Serdyuk, S. Shabanian, . Simon, S. Spieckermann, S. R. Subramanyam,

J. Sygnowski, J. Tanguay, G. van Tulder, J. Turian, S. Urban, P. Vincent, F. Visin,

H. de Vries, D. Warde-Farley, D. J. Webb, M. Willson, K. Xu, L. Xue, L. Yao, S. Zhang,

and Y. Zhang, “Theano: A Python framework for fast computation of mathematical

expressions,” pp. 1–19, 2016.

[39] A. Paszke, G. Chanan, Z. Lin, S. Gross, E. Yang, L. Antiga, and Z. Devito, “Automatic

differentiation in PyTorch,” Advances in Neural Information Processing Systems 30,

no. Nips, pp. 1–4, 2017.

[40] J. Zacharias, M. Barz, and D. Sonntag, “A Survey on Deep Learning Toolkits and

Libraries for Intelligent User Interfaces,” 2018.

[41] T. Nomi, “tiny-dnn.” https://github.com/tiny-dnn/tiny-dnn, 2017.

[42] C. Nvidia, “Nvidia CUDA C Programming Guide PG-02829-001 v9.1,” 2018.

[43] F. Kintz, “GPU Performance Enhancement.” https://wiki.tum.de/display/lfdv/GPU

+Performance+Enhancement/, 2017.

[44] H. Chauhan, “Nvidia Is Running Away With the GPU Market.”

https://www.fool.com/investing/2017/12/06/nvidia-is-running-away-with-the-gpu-

market.aspx, 2017.

[45] NVIDIA, “GEFORCE GTX 1080 Ti.” https://www.nvidia.com/en-

us/geforce/products/10series/geforce-gtx-1080-ti/, 2017.

74 BIBLIOGRAPHY

[46] W. Nvidia, N. Generation, and C. Compute, “Fermi white paper,” ReVision, vol. 23,

no. 6, pp. 1–22, 2009.

[47] D. Kirk, “NVIDIA cuda software and gpu parallel computing architecture,” Proceedings

of the 6th international symposium on Memory management - ISMM ’07, pp. 103–104,

[48] P. Warden, “Why GEMM is at the Heart of Deep Learning.”

https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/,

[49] X. Li, G. Zhang, H. H. Huang, Z. Wang, and W. Zheng, “Performance Analysis of

GPU-Based Convolutional Neural Networks,” 2016 45th International Conference on

Parallel Processing (ICPP), pp. 67–76, 2016.

[50] S. Hadjis, F. Abuzaid, C. Zhang, and C. Re, “Caffe con Troll,” in Proceedings of the

Fourth Workshop on Data analytics in the Cloud - DanaC’15, pp. 1–4, 2015.

[51] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shel-

hamer, “cuDNN: Efficient Primitives for Deep Learning,” pp. 1–9, 2014.

[52] F. Abuzaid, S. Hadjis, C. Zhang, and C. Re, “Caffe con Troll: Shallow Ideas to Speed

Up Deep Learning,” 2015.

[53] J. Keuper and F. J. Preundt, “Distributed training of deep neural networks: Theoret-

ical and practical limits of parallel scalability,” Proceedings of MLHPC 2016: Machine

Learning in HPC Environments - Held in conjunction with SC 2016: The Interna-

tional Conference for High Performance Computing, Networking, Storage and Analysis,

pp. 19–26, 2017.

BIBLIOGRAPHY 75

[54] D. C. Ciresan, “Simple C/C++ code for training and testing MLPs and CNNs.”

http://people.idsia.ch/˜ciresan/data/net.zip.

[55] M. Tavallaee, , and E. W. a. G. A. A. Lu, “A detailed analysis of the KDD CUP 99

data set,” no. Cisda, pp. 1–6, 2009.

[56] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learining Internal Representations

by Error Propagation,” 1986.

[57] D. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber, “Flexible, High

Performance Convolutional Neural Networks for Image Classification,” International

Joint Conference on Artificial Intelligence (IJCAI) 2011, pp. 1237–1242, 2011.

[58] A. Krizhevsky, I. Sutskever, and H. Geoffrey E., “ImageNet Classification with Deep

Convolutional Neural Networks,” Advances in Neural Information Processing Systems

25 (NIPS2012), p. 19, 2012.

[59] Y. L. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of feature pooling in

visual recognition,” 27th International Conference on Machine Learning, 2010.

[60] J. Principe, N. Euliano, and W. Lefebvre, Neural and Adaptive Systems: Fundamentals

Through Simulation: Multilayer Perceptrons. 1997.

[61] S. H. Djork Arne Clevert, Thomas Unterthiner, “Fast And Accurate Deep Network

Learning By Exponential Linear Units (ELUs),” in ICLR, vol. 285, pp. 1760–1761,

[62] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On Large-

Batch Training for Deep Learning: Generalization Gap and Sharp Minima,” pp. 1–16,

76 BIBLIOGRAPHY

[63] Y. LeCun, L. Bottou, G. B. Orr, and K. R. Muller, “Neural Networks: Tricks of the

Trade,” Springer Lecture Notes in Computer Sciences, no. December, 1998.

[64] M. M. Rahman and D. N. Davis, “Addressing the Class Imbalance Problem in Medical

Datasets,” International Journal of Machine Learning and Computing, pp. 224–228,

[65] F. Provost, “Machine learning from imbalanced data sets 101,” Proceedings of the

AAAI’2000 Workshop on . . . , p. 3, 2000.

[66] N. Japkowicz, “Learning from Imbalanced Data Sets: A Comparison of Various Strate-

gies,” AAAI workshop on learning from imbalanced data sets, vol. 68, pp. 10–15, 2000.

[67] Z. H. Zhou and X. Y. Liu, “Training cost-sensitive neural networks with methods ad-

dressing the class imbalance problem,” IEEE Transactions on Knowledge and Data

Engineering, vol. 18, no. 1, pp. 63–77, 2006.

[68] R. Reed, “Pruning Algorithms - A Survey,” IEEE Transactions on Neural Networks,

vol. 4, no. 5, pp. 740–747, 1993.

[69] B. Fritzke, “Growing cell structures-A self-organizing network for unsupervised and

supervised learning,” Neural Networks, vol. 7, no. 9, pp. 1441–1460, 1994.

[70] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov,

“Improving neural networks by preventing co-adaptation of feature detectors,” pp. 1–

18, 2012.

[71] R. A. Jacobs, “Increased rates of convergence through learning rate adaptation,” Neural

Networks, vol. 1, no. 4, pp. 295–307, 1988.

[72] A. Krogh and J. A. Hertz, “A Simple Weight Decay Can Improve Generalization,”

Advances in Neural Information Processing Systems, vol. 4, pp. 950–957, 1992.

BIBLIOGRAPHY 77

[73] G. E. Hinton, “Learning translation invariant in massively parallel networks,” in Pro-

ceedings of PARLE Conference on Parallel Architectures and Languages Europe, pp. 1–

13, 1987.

Appendix A

Mathematical Derivations

A.1 Forward and Backpropagation

The math for the training works as follows: the loss function, or overall error, is defined

as the average of errors for individual patterns, or input vectors (A.1). The error of each

pattern is defined as a sum of squares of the difference between the measured and target

outputs for each output of the network. In the case of digit recognition, this would be the

sum of 10 squares (A.2). We normalize this equation by the number of outputs so the error is

not a function of the number of outputs, and the factor of two is simply to reduce the need of

multiplications later when the derivative is computed. Note that in the case of classification

problems as discussed in this paper, the target output would be 1 for the correct label, and

0 for every other label.

Ep (A.1)

Ep :=1

(Om − Tm)2 (A.2)

A.1. Forward and Backpropagation 79

A given neuron’s output, as discuss earlier, is based on the activation function, and the net

is the weighted sum of the inputs plus the bias.

Oj := φ(netj) (A.3)

netj :=∑i

wijOi + bj (A.4)

The last definition needed is the updating of the weights, called gradient descent, where a

weight is updated in the opposite direction of its effect on the error computed. This update

has a magnitude and direction, the direction being the partial derivative and the magnitude

being some variable referred to as the learning rate η.

wij(k + 1) := wij(k)− η ∂E∂wij

If adaptive learning rate 0 < a ≤ 1 is used, then after each epoch the learning rate is updated

η(k + 1) := aη(k) (A.6)

To disable the adaptive learning rate, a is set to 1, keeping η constant.

If momentum α is used, then the ∂E∂wij

in (A.5) becomes ∂E∂wij

∗, a weighted average with its

history:

∂wij

∗:= α

∂wij

∗+ (1− α)

∂wij

To disable momentum, α is simply set to 0.

80 Appendix A. Mathematical Derivations

Using (A.1), ∂E∂wij

can be solved for:

∂wij

And the partial ∂Ep

∂wijcan be solved for using the Chain Rule:

∂wij

=∂Ep

· ∂Oj

∂netj· ∂netj∂wij

The last two components are easily solved for.

∂netj∂wij

∂wij

wmjOm + bj)

∂wij

(w0jO0 + w1jO1 + ...+ wijOi + ...+ wmjOm + bj)

∂netj∂wij

(A.10)

∂netj=

∂netjφ(netj) (A.11)

Note that this requires the activation function to be differentiable. Solving for the final

component, ∂Ep

∂Ojis more complex. If Oj is in the output layer, then its effect on Ep is easy

to calculate using (A.2):

(Om − Tm)2)

((O0 − T0)2 + (O1 − T1)2 + ...+ (Oj − Tj)2 + ...+ (Om − Tm)2)

· 2(Oj − Tj)

(Oj − Tj)

(A.12)

A.1. Forward and Backpropagation 81

However, if Oj is not in the output layer, then its contribution to the error is the sum of

its contribution to the error through each of the neurons in the layer above it that used its

output as one of their input, which in the case of an MLP, means all the neurons in the layer

above it.

· ∂Om

(A.13)

where m iterates through all the neurons of the parents layer. Solving for those components,

the Chain Rule can again be used to obtain

=∂Om

∂netm· ∂netm∂Oj

(A.14)

where ∂Om

∂netmcan be solved using (A.11). ∂netm

∂Ojwas defined for neuron with output j being an

input to neuron m (as set up in (A.13), so it can easily be solved for by expanding the sum:

∂netm∂Oj

wnmOn + bm)

(w0mO0 + w1mO1 + ...+ wjmOj + ...+ wnmOn + bm)

∂netm∂Oj

(A.15)

where n iterates through all the inputs to neuron m, effectively iterating through all the

neurons in the layer of neuron j.

Combining (A.13, A.14, A.11, A.15):

· ∂

∂netm· φ(netm) · wjm (A.16)

Combining the equations above, we get the following:

∂wij

=∂Ep

· ∂Oj

∂netj· ∂netj∂wij

(Oj − Tj) if Oj is in output layer∑m

∂Om· ∂Om

∂netm· wjm Otherwise

(A.17)

∂netj=

∂netjφ(netj) (A.11)

∂netj∂wij

= Oi (A.10)

A.2 Softmax and Cross Entropy

When using softmax as the final activation function, a different error function from the one

defined in (A.2) is used. In Softmax, each output is defined as:

Oi :=eneti∑j e

netj(A.18)

with j iterating through all the neurons in the output layer. The error function replacing

(A.2) is defined as:

Ep :=∑j

tj logOj (A.19)

The error, as derived in (A.9) would only have its ∂Ep

∂Oj· ∂Oj

∂netjpart change in this case, so we

will derive that again with this new error function.

∂neti=

· ∂Oj

∂neti(A.20)

A.2. Softmax and Cross Entropy 83

where j iterates through all the neurons in the output layer. Solving for those two parts:

(−∑k

tk · logOk)

= −tj ·∂ logOj

=−tjOj

(A.21)

where O is a specific output, and k iterates through all neurons in the output layer, and

using the Quotient Rule:

∂neti=

∂neti(

enetj∑k e

=(∂e

∂netj)(∑

k enetk)− (enetj)(

∂∑

k enetk

∂neti)

k enetk)2

=(∂e

∂netj)(∑

k enetk)− (enetj)(enetj)

k enetk)2

(eneti )(

∑k enetk )−(eneti )2

k enetk )2= Oi −O2

i i = j

(0)(∑

k enetk )−(enetj )(eneti )

k enetk )2= −OjOi i 6= j

(A.22)

which is achieved using the definition in (A.18).

Next, plugging this into (A.19) gives:

∂neti=

(−tjOj

(∂Oj

∂neti)

assumesi 6= j for all cases︷︸︸︷∑j

(−tjOj

)(−OjOi) ]−

subtract wrong i = j︷︸︸︷(−tiOi

)(−OiOi) +

add correct i = j︷︸︸︷(−tiOi

)(Oi −O2i )

= [∑j

tjOi]− tiOi − ti(1−Oi)

Sum of targets is 1︷︸︸︷∑j

tj −tiOi − ti + tiOi

= Oi − ti

(A.23)

So when using Softmax as the activation function on the output layer, the loss function

changes from (A.2) to (A.19), and when backpropagating, ∂Ep

∂neti, where neuron i is in the

output layer changes from (A.12) and (A.11) to (A.22). This concludes the derivation of the

math required to train a DNN.

Enhanced Neural Network Training Using Selective ... · Backpropagation and Forward Propagation...

Documents