Post on 03-Mar-2019
transcript
Enhanced Neural Network Training Using Selective
Backpropagation and Forward Propagation
Shiri Bendelac
Thesis submitted to the Faculty of the
Virginia Polytechnic Institute and State University
in partial fulfillment of the requirements for the degree of
Masters of Science
in
Computer Engineering
Joseph M. Ernst, Co-chair
Jia-Bin Huang, Co-chair
Christopher L. Wyatt
William C. Headley
May 7, 2018
Blacksburg, Virginia
Keywords: Machine Learning, Neural Networks, Convolutional Neural Networks,
Backpropagation, Selective Training
Copyright 2018, Shiri Bendelac
Enhanced Neural Network Training Using Selective
Backpropagation and Forward Propagation
Shiri Bendelac
(ABSTRACT)
Neural networks are making headlines every day as the tool of the future, powering artificial
intelligence programs and supporting technologies never seen before. However, the training
of neural networks can take days or even weeks for bigger networks, and requires the use of
super computers and GPUs in academia and industry in order to achieve state of the art
results. This thesis discusses employing selective measures to determine when to backpropa-
gate and forward propagate in order to reduce training time while maintaining classification
performance. This thesis tests these new algorithms on the MNIST and CASIA datasets,
and achieves successful results with both algorithms on the two datasets. The selective back-
propagation algorithm shows a reduction of up to 93.3% of backpropagations completed, and
the selective forward propagation algorithm shows a reduction of up to 72.90% in forward
propagations and backpropagations completed compared to baseline runs of always forward
propagating and backpropagating. This work also discusses employing the selective back-
propagation algorithm on a modified dataset with disproportional under-representation of
some classes compared to others.
Enhanced Neural Network Training Using Selective
Backpropagation and Forward Propagation
Shiri Bendelac
(GENERAL AUDIENCE ABSTRACT)
Neural Networks are some of the most commonly used and best performing tools in machine
learning. However, training them to perform well is a tedious task that can take days or even
weeks, since bigger networks perform better but take exponentially longer to train. What
can be done to reduce training time? Imagine a student studying for a test. The student
likely solves practice problems that cover the different topics that may be covered on the test.
The student then evaluates which topics he/she knew well, and forgoes extensive practice
and review on those in favor of focusing on topics he/she missed or was not as confident
on. This thesis discusses following a similar approach in training neural networks in order
to reduce their training time needed to achieve desired performance levels.
Acknowledgments
I would like to thank my committee, Dr. Ernst, Dr. Huang, Dr. Wyatt, and Dr. Headley,
for supporting me in my work and guiding me throughout my undergraduate and graduate
career. Thank you to everyone at the Hume Center for being a valuable resource and
encouraging me to take on new challenges. Thank you to the OPM for supporting my
academic endeavors through the SFS program. Thank you to my friends for the great
memories from the past five years at VT. Thank you to everyone I’ve worked and interacted
with in the past few years who helped me get to this point. And finally, a big thank you to
my family for their endless love and support.
v
Contents
List of Figures ix
List of Tables xiii
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.4 Organization of Paper . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2 Datasets 7
2.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 CASIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.2.1 Preprocessing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
3 NN Library 11
3.1 Motivation for Developing a New Library . . . . . . . . . . . . . . . . . . . . 11
3.1.1 Understanding GPUs . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 Library Qualification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
vi
3.3 Library API . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 NeuralNet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.2 ImageIn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.4 MLP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.5 CNN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.6 Pooling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.7 Activation Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.8 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.9 dE/dw . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.10 Validation, Testing, and Weights Logging . . . . . . . . . . . . . . . . . . . . 25
3.11 Augmentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.12 Input File Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4 Selective Backpropagation 32
4.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.3.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.3.2 CASIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5 Selective Forward Propagation 53
vii
5.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
5.2 Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3.1 MNIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3.2 CASIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
6 Future Work 63
6.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
7 Conclusions 66
7.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
Bibliography 68
Appendices 78
A Mathematical Derivations 78
A.1 Forward and Backpropagation . . . . . . . . . . . . . . . . . . . . . . . . . . 78
A.2 Softmax and Cross Entropy . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
viii
List of Figures
2.1 Sample digits from the MNIST dataset. The training set includes 6,000 pic-
tures from each category, that are each 28x28 grayscale pixels. . . . . . . . 8
2.2 Two images of the same Chinese character, before and after preprocessing to
resize, center, and increase contrast. . . . . . . . . . . . . . . . . . . . . . . 9
3.1 Sample multilayer perceptron neural network with four inputs, one hidden
layer with six neurons, and three outputs. . . . . . . . . . . . . . . . . . . . 19
3.2 A neural network neuron j calculates its output based on the inputs, its weights
and bias, and activation function. . . . . . . . . . . . . . . . . . . . . . . . 19
3.3 Two types of pooling: max pooling and average pooling, which may be added
after a convolutional layer in order to downsample its input. . . . . . . . . . 21
3.4 Commonly used activation functions. In recent years, ReLU has become the
most popular. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.5 Augmentation visualization tool, showing the original image on the left and
the augmented version on the right. . . . . . . . . . . . . . . . . . . . . . . . 26
4.1 Number of Backpropagations VS Epoch Duration (s) at 0.9 BP filter. . . . 34
4.2 Epoch duration during training with BP 1.0 under different environment con-
ditions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
4.3 Architecture of the CNN used to classify MNIST. . . . . . . . . . . . . . . . 40
ix
4.4 MNIST testing accuracy with different BP thresholds. When plotting accu-
racy as function of epochs passed, the curves have no significant variation
between them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.5 MNIST Testing accuracy with different BP thresholds. BP 1.0, the baseline
of always backpropagating, takes longer (more BPs) to achieve the same per-
formance as other curves on the graph. It catches up but does not show better
performance than runs with a lower BP threshold in the long run. . . . . . 42
4.6 Zooming in on the initial relevant section of MNIST Testing accuracy with
different BP thresholds. BP 1.0, the baseline of always backpropagating,
initially significantly under performs runs that selectively backpropagate, and
takes time to catch up to them after they plateau. . . . . . . . . . . . . . . 42
4.7 Plotting both the performance on the testing set as well as the numbers of
BPs performed in each epoch shows the rapid decrease in BPs performed,
dropping to below 10,000 after only 7 epochs from the initial 54,000, which
would remain constant for a baseline test. By that point, performance reaches
96.04% on the testing set. It goes on to pass 98.7%, the majority of that time
spent completing less than 5,000 BPs per epoch. . . . . . . . . . . . . . . . 43
4.8 Testing accuracy over BPs on disproportional MNIST dataset. . . . . . . . 44
4.9 Error matrices early in the training process, training a neural network on a
disproportional MNIST dataset. . . . . . . . . . . . . . . . . . . . . . . . . 46
4.10 Histogram of total misses per class early in the training process. . . . . . . 47
4.11 Error matrices at the end of the training process, training a neural network
on a disproportional MNIST dataset. . . . . . . . . . . . . . . . . . . . . . 48
x
4.12 Histogram of total misses per class early in the training process. . . . . . . 49
4.13 Architecture of the CNN used to classify CASIA. . . . . . . . . . . . . . . . 50
4.14 CASIA testing accuracy with different BP thresholds. When plotting accuracy
as function of epochs passed, the curves show no significant variation between
them. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.15 CASIA Testing accuracy with different BP thresholds. BP 1.0, the baseline
of always backpropagating, stagnates behind all other curves, taking at least
2.2 times as long to reach 92% testing accuracy. . . . . . . . . . . . . . . . 52
4.16 Zooming in on the relevant section of CASIA Testing accuracy with different
BP thresholds. BP 1.0, the baseline of always backpropagating, significantly
under performs runs that selectively backpropagate, and shows no sign of
catching up to the rate at which they improve. . . . . . . . . . . . . . . . . 52
5.1 Number of Forward Propagations VS Epoch Duration (s) at FP Thresh-
old=0.5 and Max Delay=15, showing a positive linear relationship between
FPs and epoch duration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.2 Testing accuracy over FPs on the standard MNIST dataset with different
FP max delays. In all runs, FP threshold is set to 0.5. Over time, all runs
converge to approximately the same level of performance. . . . . . . . . . . 59
5.3 Zooming in on the initial relevant section of MNIST testing accuracy with
different FP max delays, with FP threshold set to 0.5. FP 1, the baseline of
always forward propagating and always backpropagating, lags behind other
curves. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
xi
5.4 Training on different subsets of MNIST shows using selective BP and FP
algorithms result in increased generalization variability over the baseline. . . 60
5.5 Testing accuracy over FPs on the CASIA subset dataset with various FP max
delays with FP threshold set to 0.5. Over time, runs converge to approxi-
mately the same performance on the testing set. . . . . . . . . . . . . . . . 62
5.6 Zooming in on the initial relevant section of Fig. 5.5. FP max delay of 1, the
threshold, is seen lagging behind the other curves before they all plateau. . 62
xii
List of Tables
3.1 Library verification results showing integration process for features and vali-
dation against other framework. . . . . . . . . . . . . . . . . . . . . . . . . . 15
5.1 Analysis of BP and FP combinations, showing potential benefits in perfor-
mance and time reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
7.1 Summary of time improvements achieved with selective BP and selective FP,
including on modified imbalanced MNIST. . . . . . . . . . . . . . . . . . . . 66
xiii
Chapter 1
Introduction
1.1 Motivation
Artificial Intelligence has become one of the biggest fields in industry and academia, with
neural networks becoming a tool widely deployed to solve all sorts of different problems,
from medical applications such as detecting cancer [1, 2], to self driving cars [3, 4, 5], Optical
Character Recognition (OCR) [6, 7], cyber security [8, 9], face recognition [10], and cognitive
radios and spectrum sensing [11, 12, 13], to list a few.
Training a neural network (NN), however, takes time. The more data available to train the
network, and the bigger the network, the longer the training takes. This can result in training
routines that take weeks [14, 15, 16]. While letting the training phase take that long may
lead to record breaking performance of networks, it may be an impractical constraint that
takes a toll on the ability to train networks quickly. Further more, it is the reason that tech
giants invest in farms of Graphics Processing Units (GPUs), an expensive technology, to try
and reduce the training time, while others are testing other specialized hardware designed
for these computations, as well as for faster inference [17, 18, 19, 20].
This paper proposes changes to the training routine by taking a closer look at the forward
and backpropagation, and introducing a new decision process for whether or not to train
on certain input vectors. The results discussed show that training time when training with
1
2 Chapter 1. Introduction
serial routines takes significantly less time. Runs that employ the selective BP algorithm
show a reduction of up to 93.3% of backpropagations completed, and runs with the selective
forward propagation algorithm show a reduction of up to 72.90% in forward propagations
and backpropagations completed compared to baseline runs of always forward propagating
and backpropagating. Employing selective algorithms to convergence shows they plateau
at approximately the same level of performance as baseline, although they do show more
variability in generalization. The selective BP algorithm is also examined on imbalanced
data and shows equal promise in reducing training time.
1.2 Contributions
This research seeks to discuss an opportunity to reduce training time without hindering per-
formance by introducing conditional forward propagation and backpropagation, as opposed
to current methods of forward propagating and backpropagating every single input vector
iteratively until performance ceases to improve. In doing so, a faster training algorithm is de-
veloped that can be used on lower cost and lower resource platforms, such as low-end CPUs,
embedded devices, on-the-field devices, and time sensitive applications where the speed of
the learning curve is critical. The work also shows potential improvement in training time
on imbalanced datasets.
Additionally, a big effort in this work went into the development of a framework that supports
these algorithms. This source code is expected to be released for public use some time in
the near future.
The work discussed in Chapter 4 of this thesis will also be submitted as a conference paper.
1.3. Related Work 3
1.3 Related Work
Different algorithms have been proposed under the name of selective training or selective
backpropagation, or otherwise resembling some of the work in this thesis. Engelbrecht pro-
posed two modifications to neural network training routines: incremental learning and selec-
tive learning [21]. Both are centered around the idea of training the network to convergence
on small subsets of the training set before exposing the network to other data, either by
adding the new subset to the subsets already used in training (incremental learning), or by
switching to the new subset and keeping the training subset used for any given epoch small.
These algorithms are both different from the work discussed here, which uses the network’s
performance as a guide in selecting which vectors to use for training.
Some work has been done to introduce another form of selective backpropagation on RBF
networks [22]. In this work, regular backpropagation is used for the majority of the training,
then when the loss function is minimal, binary backpropagation is used based on correct/in-
correct classification of vectors. This was done primarily to prevent overtraining of RBF
networks, and learn the last few unlearned vectors in the training set to reach better perfor-
mance.
Listprop [23] is another algorithm that shares resemblance with the selective BP algorithm
described here. This algorithm builds a list of which output neurons should be included in
the backpropagation, hence its name. Little work exists that fully tested Listprop’s potential,
though it largely targeted weight saturation. Listprop showed some reduction of training
time and increased peak performance, though this work was done with a 3-layered MLP and
may decrease with deeper networks.
The work discussed in this thesis is performed on classification problems using CNNs. How-
ever, in the object detection field specifically, focal loss [24] and hard example mining [25]
4 Chapter 1. Introduction
follow similar ideas. Focal loss modifies the equation for cross entropy loss to better overcome
class imbalance in a one-stage object detector. Hard example mining is an example boot-
strapping neural networks for detection problems with some resemblance to the incremental
learning algorithm discussed earlier. This also helps overcome class imbalance, which is
common in object detection problems with regards to balancing foreground and background
regions of interest (RoIs). Shrivastava’s work, similarly to the work discussed here, only
backpropagates difficult vectors, which it picks out as training takes place. The hard exam-
ple mining happens in two stages: first, the dataset is forward passed through the network,
and the difficult vectors are picked out. Then, once enough vectors have been marked as
problematic, the training occurs, which involves repeatedly backpropagating all those vec-
tors. Once this is done, the process reverts to continue iterating through the training set and
collect another subset of problematic vectors. This differs from the work in this thesis, where
each batch is only backpropagated once before the network continues to iterate through the
training set. Repeatedly using the vectors that have already been selected as hard allows the
network to spend more time training on difficult examples, rather than forward propagating
the majority of the training set that it already performs well on. However, the work in this
thesis achieves similar theoretical performance when combined with the selective forward
propagation, which after one epoch of examining all vectors would know to skip the easy
examples, essentially meaning that the next few epochs would only be composed of forward
and backpropagating the challenging vectors that the hard example mining algorithm would
iterate through.
1.4. Organization of Paper 5
1.4 Organization of Paper
Chapter 2 provides an introduction to the datasets used in this work, MNIST and CASIA.
The chapter explains why these two datasets were chosen, and what pre-processing was
completed prior to training the network on this data, as well as the motivations for these
choices.
Chapter 3 discusses the NN library developed for this research. The chapter opens with a
survey of existing frameworks, as well as factors that were considered for this application,
explaining why the ultimate decision was to develop a new framework. This is followed by a
deeper examination of the library, including its structure and public API, as well as a discus-
sion of different features implemented in the library, including MLP and convolution, types
of pooling, activation functions, training validation and testing procedures, dE/dw normal-
ization, weights logging, data augmentation, and explanation of the input configuration file
format.
Chapter 4 discusses the selective backpropagation portion of this work, including an in-depth
discussion of the motivation for this method of training, the approach used, and analysis
of the results achieved on both the MNIST and CASIA datasets, as well as a modified
imbalanced MNIST subset.
Chapter 5 delves into the other portion of this work, that is the selective forward propagation.
Following a similar structure, it discusses the motivation and rational for applying a selective
algorithm on the forward propagation element of training neural networks, the approach
used, and discusses results achieved on MNIST and CASIA. This chapter also includes an
analysis of the algorithms’ generalization.
Chapter 6 suggests future work that may be completed in order to further develop ideas and
results displayed in this work.
6 Chapter 1. Introduction
Finally, Chapter 7 concludes this paper with a summary of the work completed, its motiva-
tion, and the results achieved.
Chapter 2
Datasets
For deep neural networks to work well, large amounts of data are needed for training. This
chapter discusses the two datasets used for this thesis: MNIST and CASIA, including infor-
mation about them, as well as why they were chosen.
2.1 MNIST
First published in [26], MNIST [27] has become one of the most popular databases to test
neural networks [28]. MNIST consists of images of handwritten digits, as seen in Fig. 2.1. The
images are in grayscale, so each pixel is represented as an unsigned byte with values ranging
from 0-255, and each image is composed of 784 pixels (28x28). Images were preprocessed to
normalize around their center of mass. MNIST provides a training set of 60,000 pictures,
and a testing set of an additional 10,000 pictures. There is no overlap of authors between
the two sets, to ensure the handwritings in each are unique. MNIST is frequently referred
to as a toy problem [29, 30] due to the ease of solving it compared to other datasets with
more classes, more samples, and larger inputs. Because of this, performance on MNIST is
well documented and it is often used as a ‘Hello World’ for ML and a test case for new
algorithms.
7
8 Chapter 2. Datasets
Figure 2.1: Sample digits from the MNIST dataset. The training set includes 6,000 picturesfrom each category, that are each 28x28 grayscale pixels.
2.2 CASIA
The Institute of Automation of Chinese Academy of Sciences (CASIA) offers several datasets
of handwritten Chinese characters, which were used for the ICDAR 2011 and 2013 competi-
tions [31]. The data used for this work was of offline handwriting, where inputs are pictures
of the handwritten characters, as opposed to online OCR which includes chronological in-
formation of how the character was drawn, offering information on the direction, speed, and
order of strokes the author used [32]. This set consists of 3755 classes (or characters) with
200 samples per class [33, 34]. Images are grayscale, and initially each image is a different
size. Similarly to the work in [14], images were preprocessed to be 48x48 pixels. Data is
encoded in one file, one image at a time, each prefixed by a 10 byte header, summing to a
constant 2314 bytes after the preprocessing takes place. The header contains the image’s
total size, its tag, and its width and height. This is followed by the image, each pixel taking
one byte. Training on the dataset, as is, would have taken significant amounts of time. In-
stead, a subset was created of the first 100 labels. The training set used included all samples
from the original training set of those 100 labels. Likewise, the new testing set used included
2.2. CASIA 9
(a) Chinese character 61111 sample 88 (b) Chinese character 61111 sample 8197
Figure 2.2: Two images of the same Chinese character, before and after preprocessing toresize, center, and increase contrast.
all samples of those 100 labels from the original testing set.
CASIA is a much more challenging dataset for a NN than MNIST. The images are larger, and
there are more classes, even with a subset of only 100 labels. Additionally, Chinese characters
have more strokes and features, and handwriting styles offer greater variability [14], including
due to changes in handwriting over time and frequency of characters [31]. CASIA also has
only a few hundred samples per class, compared to thousands in MNIST. Due to all of these
reasons, CASIA is a far more challenging dataset to train on compared to MNIST, and is
used in this thesis to test the algorithms discussed on more difficult problems. On the other
hand, CASIA was chosen over other popular difficult problems to ensure that even with a
minimized dataset, debugging of the framework and training could be completed under the
set time constraints.
2.2.1 Preprocessing
The CASIA dataset was preprocessed once to create a modified dataset. This processing was
primarily performed in order to resize the pictures to a constant size of 48x48 pixels from
10 Chapter 2. Datasets
their original varying sizes. An additional motive, however, was to perform some additional
processing, altering the image contrasts and centering the image around its center of gravity,
to facilitate the neural network extract features later on. In order to resize images, their
center of gravity (COG) was computed, as well as their variance in the X and Y directions.
Images were then centered around their COG and a scaled bitmap was computed to stretch
the image to 48x48 pixels while maintaining previous proportions. New pixels (mostly caused
from shifting COG) were filled with the detected background color. Finally, the picture’s
contrast was increased by finding the darkest pixel in the image and use it as a reference
point in scaling all the pixels’ darkness between 0 and 255. This preprocessing can be seen in
Fig. 2.2, which shows two different samples of the same classification. In each subfigure, the
image on the left is the original from the database, with a black rectangle around its margins
and is drawn centered at the calculated COG, as marked by a red dot. The images on the
right are the output post processing, which includes centering around the COG, stretching
proportionally according to the standard deviation to fit 4.5 sigmas into the frame, and
increasing the contrast of the image.
Chapter 3
NN Library
3.1 Motivation for Developing a New Library
One of the first steps in this research was to identify a machine learning framework that could
be modified to test the effectiveness of the algorithmic modifications developed. While several
frameworks were investigated, ultimately the decision was made to create a new framework
to facilitate this investigation. This section details the motivation for the creation of this
framework.
Early on, the experiment statement was established: trying to expedite the training routine
of neural networks. The common approach to training of neural networks involves iterating
through the training set, passing them cyclically through the network and backpropagating
to adjust the network until validation stops improving. This thesis experiment was to try and
accelerate this process with two methods: first, only backpropagate, or update the network,
based on training inputs that are misclassified, or classified with a low confidence; second,
try to detect which vectors the network is performing well enough on that they may be
skipped for a few iterations, and predict how long these vectors can be skipped. These ideas
are explained much more in detail as well as their achieved results in Chapters 4-5. The
intention was to be able to train a neural network with these changes in place, and see their
affect on the training speed as well as peak performance achieved.
11
12 Chapter 3. NN Library
From prior experience with neural networks, as well as from literature, it was clear early on
that the computation capabilities for training neural networks, especially with large datasets
and large networks, is significant. Additionally, this work had a time restriction of one
academic year. With this in mind, a state of the art computer within budget was purchased
for this project, which included the most powerful ‘desktop’ GPU on the market at the time
of purchase. This computer is equipped with a 4.2GHz Intel Core i7-7700K processor, and
an NVIDIA GeForce GTX 1080 Ti GPU with 11GB GDDR5X SDRAM, 3584 CUDA cores,
and a memory bandwidth of 484GBps, with the intention to use an existing framework that
is optimized to take advantage of this GPU.
There are many existing deep learning open-source frameworks, with TensorFlow [35], Keras [36],
Caffe [37], Theano [38], and PyTorch [39] being among the most popular, though there are
other additional lesser-known platforms as well [40]. As these frameworks were being studied
as potentials for this work, the algorithms for selective BP and FP were being refined, which
clarified the type of flexibility required of any framework that may be used. Specifically,
forward propagating and then sometimes deciding not to backpropagate, and a system in
place to decide whether or not to forward propagate. None of the frameworks that were
evaluated appeared to offer the option to make these decisions while using the network as a
black-box, which meant that the chosen framework’s source code would need to be modified
in a hopefully controlled and minimal fashion. Two libraries were studied more in depth
than others as this refining process was taking place: Caffe [37], and TinyDNN [41], though
others were also considered. The majority of the open source libraries are written in C++
at their core, though many also offer a Python API. When using an open-source framework,
GPU support was essential, since it can offer a theoretical speed increase of up to a 10 times
over a pure CPU implementation [42, 43]. While there are two main manufacturers, NVIDIA
and AMD [19], NVIDIA has a wide lead in the market [44] and its CUDA architecture is
3.1. Motivation for Developing a New Library 13
widely supported by many open-source frameworks, including Caffe. However, when ex-
amining GPU characteristics more closely, a major issue was revealed, which requires some
understanding of the characteristics of GPUs.
3.1.1 Understanding GPUs
GPUs are able to offer such impressive results compared to CPUs thanks to offering a colossal
parallelization ability [42]. GPUs have up to over 5,000 independent cores, the GTX 1080
TI has 3,584. Each core is a RISC processor capable of running generic C/C++ code. The
NVIDIA GTX 1080 Ti GPU has a GDDR5 SDRAM with a memory speed of 11 Gbps and
interface width of 352 bits, resulting in a 484 GBps memory bandwidth [45]. The CPU, for
comparison, has a memory speed of DDR4-3000 at 64 bits, resulting in 24 GBps, meaning
the GPU’s memory speed is 20 times faster. The problem is that this is shared across the
3584 cores, as opposed to 8 cores for the CPU (4 physical cores with hyperthreading). The
bottom line is that the GPU’s limited bandwidth serves as its main bottleneck, both to
external memory and to the host. The GPU grid is made up of an array of blocks, each is
able to run up to 1024 threads, with one Shared Memory (SM) for each block. The block
runs on a physical warp, which is composed of 32 cores. [46, 47]
It has been shown that convolutional layers take the most time during training of neural
networks, both on CPU and GPU [48, 49], taking as much as 90% of the time spent on a
forward pass. While GPUs are capable of running generic C++ code on independent cores,
the way NN libraries have overcome the memory bandwidth bottleneck to achieve accelerated
learning on a GPU is by parallelizing the batch computations in a way that takes advantage of
the functionality most optimized for GPUs, which is matrix multiplication. Generic Matrix
Multiplication (GEMM) is part of Basic Linear Algebra Subprograms (BLAS), which has
14 Chapter 3. NN Library
been optimized to run on NVIDIA’s GPUS using the CuBLAS library.
A fully connected layer can be represented as a vector by matrix multiplication, where the
input to the layer is a 1× k vector, and the layer of n neurons is a k × n vector of weights,
where each column is a repeat of the neuron’s weights. The layer’s output is then a single
1×n matrix [48]. A convolutional layer can be represented in the same manner, where in the
case of images there may be a 3D matrix as the input, and the weights of each kernel also
form a 3D matrix [48, 50]. In order for the forward propagation in this case to be represented
as matrix multiplication, both are turned into 2D matrices. In the input matrix, each row
represents the inputs of a single kernel, and in the weights matrix each kernel’s weights are
a single column. Since the stride in convolutional layers is typically less than the kernel size,
there is an overlap between the inputs, meaning that during this process there is a large
redundancy of memory. For example, in a 5x5 kernel with stride of 1, a single input point
would be represented in 25 different rows of the input matrix. However, the architecture
of GPUs requires some level of such a redundancy at any rate, due to shared memories
being unique to each block, and the redundancy in memory is outweighed by the speed
reduction offered by the optimization of matrix multiplications. These unrolling conversions
from image to matrix format are completed using CUDA’s im2col function. Once in matrix
format, the forward passes can be completed in a much faster fashion using the optimized
CuBLAS library [43, 49]. To get peak performance from unrolling data to matrix and using
matrix multiplication, libraries combine a mini-batch of images into one matrix, allowing the
whole batch to be computer in parallel one layer at a time. Using larger minibatches shows
increased speeds [51, 52, 53]. However, this method of completing the batch in parallel one
layer at a time does not align with the platform architecture that was wanted for testing
the algorithms discussed in this thesis, so the decision was made to test the algorithms on
a CPU platform, which allows for a simpler integration of these ideas. A study of such
3.2. Library Qualification 15
Table 3.1: Library verification results showing integration process for features and validationagainst other framework.
Train (%) Validation (%) Testing (%) Description98.45 98.45 98.24 Ciresan’s original code (double, 29x29, scaled
hyptertan)98.52 98.52 98.43 Ciresan’s modified code, double, 29x29,
scaled hyptertan98.51 98.517 98.45 Ciresan’s modified code, float, 29x29, scaled
hypertan98.51 98.503 98.5 Ciresan’s modified code, float, 28x28, scaled
hyptertan99.7 98.43 98.43 ShiriNet, float, 28, relu, BP threshold 0.5,
LR 0.01, momentum 0, adaptive LR 1.099.89 99.55 98.41 ShiriNet, float, 28, relu, BP threshold 0.5,
LR 0.005, momentum 0, adaptive LR 1.099.78 98.57 98.55 ShiriNet, float, 28, relu, BP threshold 0.5,
LR 0.005, momentum 0.9, adaptive LR 1.099.89 98.33 98.39 ShiriNet, float, 28, relu, BP threshold 0.5,
LR 0.005, momentum 0, adaptive LR 0.99999.89 98.02 97.99 ShiriNet, float, 28, htan, BP threshold 0.5,
LR 0.01, momentum 0.9, adaptive LR 1.099.73 98.45 98.39 ShiriNet, float, 28, relu, BP threshold 1, LR
0.01, momentum 0.9, adaptive LR 1.0
existing frameworks found most were no longer maintained and had some documented and
undocumented bugs, or their developers switched to contributing to the popular platforms
instead of maintaining their light frameworks. Ultimately the decision was made to develop
a new CNN library to run on a CPU without parallelization to assess the effect of these two
algorithms on training time and performance. This library is written in C++ and designed
to allow future integration on GPU platforms. This library is discussed in more detail in
this chapter.
16 Chapter 3. NN Library
3.2 Library Qualification
In efforts to verify the library works as expected, its performance on MNIST was compared
to documented performance. Additionally, each implemented feature was tested separately,
and tests with multiple features enabled were approached incrementally to verify stability.
Features that are binary were tested for both modes, and variable parameters (e.g., learning
rate, momentum, etc.) were tested via scan tests with ranging values. MLP performance was
tested in comparison to performance documented in [26]. Behavior on CNNs was compared
to performance of open source code published by Ciresan [54], which was gradually modified
to match the library in using floats instead of doubles and use an input of 28x28 instead of
29x29. These tests are documented in Table 3.1.
3.3 Library API
The code is designed to only have two public elements: an ImageIn object (for CPU or
GPU in the future) and a NeuralNet object. All other classes are private and need not be
interfaced with by the end user.
3.3.1 NeuralNet
The NeuralNet class offers the following public interface:
1 class NeuralNet {
2 public:
3 NeuralNet(ImageIn *_imageIn);
4 NeuralNet(ImageIn *_imageIn , std:: string configString);
5 ~NeuralNet ();
6 int buildNet(std:: string configString);
3.3. Library API 17
7 int train(endOfEpochCallback statisticsCallback);
8 void classify(const float *inputs , int &classification , float &
confidence);
9 int testAnalysis(float *errorMatrix);
10 }
where calling the second constructor is equivalent to calling the first followed by buildNet.
The train method takes as input an endOfEpochCallback function pointer, defined as:
typedef void (endOfEpochCallback)(unsigned int epoch , float totalError ,
float stepSize , float trainingPercentage , float saturationPercentage ,
float limitPercentage , float validationPercentage , float
testingPercentage , int numOfBackprops , int numOfForwardprops , float
epochDur , std:: string logFileName);
This endOfEpochCallback callback function will be called at the end of every epoch and
pass parameters to the user regarding the training process. The user can then log these to
a file, print to the screen, or perform any other analysis and/or logging of their choosing.
The classify method can be used to perform inference on a single vector input.
The testAnalysis method, when called, iterates through the testing set and fills an error
matrix indicating how many vectors of each classification were misclassified, and what the
network classified them as.
3.3.2 ImageIn
ImageIn is a virtual class, designed to be inherited by implementations for different plat-
forms, including CPU and GPU. The main application passes a reference of its ImageIn to
the NeuralNet via its constructor (as seen in the documentation above), which then directly
interacts with it to get input vectors during training. ImageIn handles shuffling of the train-
18 Chapter 3. NN Library
ing vectors, as well as input normalization and augmentation. Input normalization includes
the following settings:
enum scalingMode_t { SCALING_NONE , SCALING_DEFAULT , SCALING_INDIVIDUAL ,
SCALING_GLOBAL };
SCALING_NONE, of course, provides the NeuralNet with the input as is, without performing
any additional normalization on it. SCALING_DEFAULT translates data from being between
0 and 255 to -0.5 and 0.5. SCALING_INDIVIDUAL may be a useful feature for datasets such
as the NSL-KDD dataset [55], where some features are continuous and some discrete, with
no predefined range that can be used for normalization. Therefore, this feature is not as
applicable to pictures. In this case, the training set is scanned to find the mean and standard
deviation of each feature across the set, which are used to normalize the samples. Finally,
SCALING_GLOBAL scans the entire training set to find the global mean and standard deviation,
which are then used to normalize inputs.
3.4 MLP
An Artificial Neural Network (ANN) is composed of a collection of objects called neurons. In
a Deep Neural Network (DNN), many neurons are connected in parallel in a single layer, and
the output of each layer is used as the input to another layer [56]. In Multilayer Perceptron
(MLP) Neural Networks (NNs) specifically, all the outputs of a given layer are connected as
inputs to each of the neurons in the next layer (Fig. 3.1), and each such connection has its
own weight. Each neuron acts as a multiplier and adder, by taking several inputs, multiplying
each by a corresponding weight, and adding those weights plus a bias bj: netj :=∑
i xiwij+bj.
This sum is then typically passed through a non-linear activation function, thus the output
of the neuron is Oj = φ(netj) (Fig. 3.2).
3.4. MLP 19
Figure 3.1: Sample multilayer perceptron neural network with four inputs, one hidden layerwith six neurons, and three outputs.
Figure 3.2: A neural network neuron j calculates its output based on the inputs, its weightsand bias, and activation function.
20 Chapter 3. NN Library
3.5 CNN
An MLP architecture’s full connectivity is a simple structure that is easy to expand or
reduce in size for varying applications. Its disadvantages, however, compared to the more
complex Convolutional Neural Network (CNN) are that it requires more weights and hence
more memory, and in cases of image processing as discussed in this paper, an MLP loses
the spatial information of its input, that is, any two dimensional information about relations
between neighboring pixels [57, 58]. This is because CNNs are designed to require far fewer
weights by having neurons share weights.
In a CNN, a single layer is composed of several feature maps. Each feature map has a
specified kernel size, such as n× n, which uses n2 ·mx−1 weights, where mx−1 is the number
of feature maps in the previous layer. All these weights are multiplied by their inputs to
calculate the net, similarly to in MLPs. Kernels may overlap with each other, sharing their
inputs, as they form the new feature maps. The key feature of CNNs, however, is that the
neurons, or kernels, in a given feature map share their weights. This allows for a feature
map to train to recognize a certain feature in the input, then scan for said feature anywhere
in the input without needing the redundancy of training separate weights for every neuron
to recognize that pattern. This enables CNNs to better handle spatial information, while
reducing the amount of weights needed.
3.6 Pooling
A pooling layer is frequently added after a convolutional layer. Pooling layers offer a form
of nonlinear downsampling of their inputs by reducing a block of congruent pixels into a
single datapoint. This allows for a smaller network, and helps detect features in approxi-
3.7. Activation Functions 21
(a) Max Pooling (b) Average Pooling
Figure 3.3: Two types of pooling: max pooling and average pooling, which may be addedafter a convolutional layer in order to downsample its input.
mate subregions. There are several possible types of pooling, and this library offers both
max pooling and average pooling. In max pooling, a single output neuron is assigned the
maximum value of its inputs. Average pooling assigns each neuron’s output the average of
its inputs. These methods are illustrated in Fig. 3.3. In this example, the input layer is 4 by
4, or 16 datapoints overall, and a pooling layer with a kernel size of 2 results in a reduction
by 1/2n where n is the number of dimensions, which in this case evaluates to 1/4, leading to
an output with only 4 datapoints. [59]
3.7 Activation Functions
In order for the network to not be a simple weighted sum, neurons pass their calculated net
value through a nonlinear activation function to compute their outputs, as seen in Fig. 3.2.
This introduces nonlinearity to the neural network. There are several activation functions
referenced in literature. Traditionally, the popular activation functions included hyperbolic
tangent (inverse tangent between -1 and 1), defined as f(net) = 21+e−2·net − 1.0 , and logistic
(inverse tangent between 0 and 1), defined as f(net) = 11+e−net [58, 60]. In recent years,
however, Rectified Linear Unit (ReLU) has become the most popular activation function [30,
61], thanks to its proven performance as well as the minimal mathematical computation it
22 Chapter 3. NN Library
Figure 3.4: Commonly used activation functions. In recent years, ReLU has become themost popular.
requires. ReLU (Rectified Linear Unit) is defined as f(net) = max(net, 0) . These activation
functions are plotted in Fig. 3.4. Note that the activation function must be derivable for
the backpropagation algorithm. In the case of ReLU, a piecewise function, the derivative of
each piece is used, regardless of discontinuity.
Another type of activation function is the Softmax function. Softmax can be used after the
final layer of the network in order to normalize the outputs so that all the confidences add up
to 1. This functionality is also implemented in the library. Softmax is defined in more detail
in Section A.2, along with its Cross Entropy loss function. The section also discusses how
using Softmax as an activation function affects backpropagation and derives the appropriate
mathematical equations.
3.8. Training 23
3.8 Training
Before the network can be used to classify inputs, it must first be trained. When the training
begins, the network initializes weights randomly within a given range such that the net for
each neuron will begin in the activation function’s active range. Then, the weights must be
tuned to provide better results. In supervised learning, this involves forward propagating
the training input vectors to get the network’s output. This output is then compared to
the target output in order to calculate an error using some loss function such as (A.1) or
(A.19). Then backpropagation is used to calculate each weight’s contribution to the error,
and gradient descent is used to adjust the weights in an effort to minimize said contribution.
In a simplistic implementation of batch mode, all the training vectors would be forward
propagated, their error would be computed, and the weights would be updated at the end of
an epoch, or an iteration over the entire set. However, this would result in training taking
far too long, and does not always yield the best results [62]. If the weights are updated
after every input vector (a method known as stochastic gradient descent) the weights are
updated much more frequently, but the gradient would be noisier. The library instead offers
a variable mini-batch size, which can be set to 1 for stochastic mode, or otherwise set to any
number of patterns, or input vectors, that should be used to get an average gradient for a
single update of the weights.
The math needed for the forward pass and backpropagation steps of training a neural network
is derived in Appendix A.
24 Chapter 3. NN Library
3.9 dE/dw
During the training process, a source of variability in common implementations of neural
networks is the rate at which weights are changed. When using gradient descent, weights
are updated iteratively using the equation
wij(k + 1) := wij(k)− η ∂E∂wij
(A.5)
where wij(k + 1) is the weight neuron j gives input i at iteration k + 1, wij(k) is the weight
neuron j gives input i at iteration k, η is the learning rate, and ∂E∂wij
is the partial derivative
of the error with respect to weight wij. In this equation, the learning rate η, depending
on the algorithm, may be constant, decrease exponentially, linearly, or be defined as a step
function. The weights and ∂E∂wij
can be seen as multidimensional vectors. The learning rate,
by theory, can control the magnitude of the step, while ∂E∂wij
provides the direction for the
change. However, the magnitude of η ∂E∂wij
is not constant unless ∂E∂wij
is normalized, since
∂E∂wij
’s magnitude is a function of the total error E. The result of not normalizing ∂E∂wij
is that
a larger error would result in a larger magnitude of ∂E∂wij
and hence is equivalent to a larger
step size, while a smaller error results in a smaller step size [63]. This vanishing gradient,
while not an incorrect behavior, is difficult to account for if designing more complex functions
for the learning rate. The library implemented offers an option to normalize the ∂E∂wij
vector
prior to updating the weights, thus eliminating an unaccounted for source of variability. This
essentially turns the equation above into−−−−−−→wij(k + 1) :=
−−−→wij(k) − η‖
−−→∂E∂wij‖. In doing so, the
learning rate entirely controls the magnitude of the step, while ∂E∂wij
controls the direction.
This allows for more complex learning rate functions to be implemented, which may be a
function of the error, but can also take into account other parameters, such as confidence,
epoch or batch number, etc.
3.10. Validation, Testing, and Weights Logging 25
3.10 Validation, Testing, and Weights Logging
As a neural network is training, its weights are tuned from random values to values producing
the wanted output more and more often. Since the training process is long and rigorous, and
the trained network needs to be reusable for quick inference in the final applications, there
is a need to be able to save the state of the network as it is training. Specifically, this means
the weights (and biases) must be stored in order to be reused for the network to be restored
to its previous state. Different libraries and implementations of neural networks use different
approaches to storing weights. Often times, the weights are recorded on a periodic basis, such
as every 5000 mini-batches. Another method is to only record the state of the network when
the network’s performance improves, in order to attempt to capture the network at its best
performing state. Therefore, it is important to assess the performance of the network during
the training routine. However, simply looking at the error as a percentage classified correctly
or the error defined in (A.1) and (A.2) on the training vectors gives a false measurement as
those errors assess the network’s performance on inputs it has already been exposed to and
trained to fit, and therefore would optimistically extrapolate its behavior on such (easier)
vectors to vectors it has not seen, on which performance would be worse. Instead, a separate
set of testing vectors is used, which the network has not been exposed to. This set offers a
more objective assessment of the network’s performance level.
The library developed also makes use of a third set, used for validation. Like the testing
set, the validation is composed of vectors the network has not trained on, and therefore
offers an unbiased assessment of its performance. Overfitting occurs when the network has
been exposed to a finite set of training inputs which do not accurately portray the entirety
of the true classification. When this occurs, the training error may continue to decrease,
while the error on an independent set of inputs starts to increase. Since the testing set
is used to measure performance and should not bias the network’s training routine in any
26 Chapter 3. NN Library
Figure 3.5: Augmentation visualization tool, showing the original image on the left and theaugmented version on the right.
way, the validation set is used to monitor for such cases. The library implemented saves
the network’s weights to a binary file in order to be used later for continued training or for
inference purposes. The training routine was designed to use the validation performance
as an indicator of when the network’s performance is increasing, and only save the weights
then, in order to ensure the best state of the network is the one saved for future use.
3.11 Augmentation
Overlearning, or overfitting, as described above, is a phenomenon that is likely to occur if the
training set is too small, or does not well represent the true categories. For example, when
training on MNIST, if most pictures of a 3 had the top right pixel turned on, the network
training may notice this pixel, and give it a great weight since it highly correlates with the
output being 3. At first, as the weights move from random, which gives approximately a 10%
accuracy, and converge towards better results, the training, validation, and testing would all
improve. However, a 3 does not really have a dot at the top right corner, and eventually the
training would continue improving as the network picks up on finer detail patterns in the
training set that are not truly characteristics of the classes, but the validation and testing
3.12. Input File Parameters 27
would not share these finer details and as such would begin to deviate from the performance
of the training. One solution to avoiding this result is to use data augmentation in order
to artificially expand the training set and expose the network to more possibilities. The
library adds support for this in the form of three optional transformations. The first is
translation, meaning slightly moving the picture randomly along the X and Y axis. this
allows the network to recognize the images even if they are slightly displaced. The second
transformation is rotation, which rotates the images randomly a few degrees clockwise or
counter clockwise. The final transformation is shearing the image, which slightly stretches
and compresses different aspects of it, exposing it to different slants.
Since these transformations are not easy to compute and can introduce bugs, a visual utility
was developed with the Qt framework to demonstrate an input vector before and after its
transformation. The tool, seen in Fig. 3.5, allows the user to select image and label files, and
step through the different vectors or input an index to jump to using the number window
on the right and the up and down arrows. It displays the label of the figure in the top
left window number, and using the right and left arrow boxes the user can seek the next or
previous vector with the same label, allowing the user to quickly browse images of the same
category and see how similar they look. Finally, three check-boxes are available on the right
to select which augmentations should be used. The image is then augmented and displayed
on the right, allowing the user to see the before and after input vectors.
3.12 Input File Parameters
The library may be configured with its public buildNet function. The main application can
therefore parse this string from a file or other input. The file format allows for comments
using the # sign. The parser is case insensitive and ignores white spaces. The expected
28 Chapter 3. NN Library
format is a key-value pair every line, separated by tabs or spaces. For example, a line in the
file meant to control the learning rate for training the network may read:
LEARNING_RATE 0.01 # Configuring initial learning rate
Or configuring the file base name to be used for the weights and log can be done using:
File_name MNIST_run_1 # weights will be saved in weights_MNIST_run_1.
bin and output log in log_MNIST_run_1.csv
Certain parameters, however, require a more complex configuration than a key-value pair.
For example, configuring a convolutional layer. This requires specifying the number of maps,
the kernel size, the kernel stride, and the activation function. To do this, we use the word
convolutional as key, and braces are used to provide a block of key-value parameters as
convolutional’s value:
CONVOLUTIONAL {
NUMBER_OF_MAPS 5
KERNEL_SIZE 3
KERNEL_STRIDE 1
ACTIVATION_FUNCTION relu
}
The following are all the parameter options for the configuration file, with sample values:
LEARNING_RATE 0.01 # initial learning rate
BATCH_SIZE 10 # batch size
MAX_ITERATIONS 2000000 # max number of epochs to train , 0
to disable (stop only when validation score deteriorates by
min_err_delta)
MIN_ERR_DELTA 1.20 # current validation error/best
validation error ratio to stop training at (to prevent further
deterioration)
3.12. Input File Parameters 29
MOMENTUM_ALPHA 0.9 # momentum alpha. 0 to disable
momentum and only use current dE/dw
ADAPTIVE_LEARNING_RATE 1.0 # adaptive learning rate (> 0, <=
1.0). 1 to disable adaptive learning rate.
WEIGHT_DECAY 0.0 # weight decay coefficient. 0 to
disable weight decay
BP_THRESHOLD 0.9 # backprop threshold. 0 to update
weights whenever wrong , 1 to always update weights (i.e. disable this
mechanism), 0 < x < 1 == update if confidence < x
MAX_FP_DELAY 1 # max epochs to go without a
forward propagation. 1 to disable this feature
AUTO_NORMALIZATION 2 # normalize input vector. 0 ==
default (-128, /256), 1 == individual index normalization , 2 == total
energy scaling , 3 == off
DERIVATIVE_NORMALIZATION 0 # normalize dE/dw vectors. 0 ==
off (default), 1 == on.
DATA_AUGMENTATION 1 # use library ’s data augmentation.
0 == off , 1 == on
WEIGHT_NORMALIZATION 0 # normalize weights. 0 == off ,
otherwise limit to positive value
DROP_OUT 0 # drop out. 0 == off , 1 == on (
only on fully connected layers , 0.5 drop probability for all neurons
except for input layer , 0.2)
FILE_NAME demo # base file name to use for log
and weights
DATA_INPUT { # input data size
DIM_X 28 # can specify DIM_X , DIM_Y and
NUMBER_OF_MAPS. Unspecified dimensions will default to 1.
DIM_Y 28
NUMBER_OF_MAPS 1
}
30 Chapter 3. NN Library
CONVOLUTIONAL { # convolutional layer , contains
number of maps in layer , kernel size , kernel stride , and activation
function
NUMBER_OF_MAPS 5
KERNEL_SIZE 3
KERNEL_STRIDE 1
ACTIVATION_FUNCTION relu
}
POOLING { # pooling layer , contains kernel
size and type of pooling
KERNEL_SIZE 2
POOL_TYPE max # max or average , defaults to max
pool
}
CONVOLUTIONAL {
NUMBER_OF_MAPS 10
KERNEL_SIZE 5
KERNEL_STRIDE 1
ACTIVATION_FUNCTION relu
}
POOLING {
KERNEL_SIZE 3
POOL_TYPE max # max or average
}
FULLY_CONNECTED { # fully connected layer , contains
layer size and activation function (logistic , hypertan , relu , softmax)
LAYER_SIZE 50
ACTIVATION_FUNCTION relu
}
FULLY_CONNECTED {
LAYER_SIZE 10
Chapter 4
Selective Backpropagation
4.1 Motivation
The training process for a neural network typically involves forward passing a batch of
inputs, calculating an error, and backpropagating to adjust weights based on this error.
The backpropagation process requires the same amount of computation regardless of how
small or large the error is. In some cases the potential benefit of backpropagating a certain
input may be minimal, and cost-benefit analysis of the potential improvement and the time
required to backpropagate may lead to the conclusion that backpropagating would not yield
a productive improvement compared to backpropagating other vectors. If a certain input
is classified correctly and the confidence of the network is high, meaning the error is small,
then completing all the computation for a backpropagation will cause little change in the
network, and likely offer only a minimal improvement. It would be more beneficial to spend
that training time backpropagating vectors the network performs poorly on.
This may especially play a role when training on datasets where some classes are far more
represented than others. Say a network was trained on a subset of MNIST with only 10
images of the digit 3, 200 images of an 8, and 100 images of every other digit. Then on the
few occasions the network gets to train on a 3, it modifies the weights to better recognize a
3. But if an 8 looks like a 3, and is far more common and therefore backpropagated on, then
at some point the potential benefit of further learning a certain 8 may come at the expense
32
4.2. Procedure 33
of worsening performance on recognizing 3’s.
4.2 Procedure
If the network only backpropagates in cases where its performance is unsatisfactory, as de-
fined by certain criteria, then training time needed to arrive at a desired level of performance
can be reduced. In this work, the filter used for deciding whether or not to backpropagate
is stateless, relying only on the output of the forward propagation. A more complex system
could be designed, such as one tracking the overall performance on each class, previous de-
cisions made about a given vector, etc. However, the stateless system is simpler, requiring
less computation and less memory.
If the categorical classification is wrong, or the confidence is below a certain threshold, then
the network backpropagates on the given input. Therefore, for the network to decide not to
backpropagate following a forward propgation, the classification had to be correct and with a
high confidence, as defined by the user. The confidence threshold is set in the configuration
flag using the BP_THRESHOLD flag, followed by a decimal between 0.0 and 1.0 inclusive.
When this parameter is set to 1.0, the filter is effectively disabled and baseline performance
is achieved, since the network will always backpropagate, as confidence is always less than
or equal to 1. If the parameter is set to 0, then the network backpropagates any time the
network was wrong, and never when it was right, since confidence is always greater than 0.
For any value in between, the network backpropagates when classification was wrong (i.e.
the highest confidence was not of the correct output) or when the classification was correct,
but confidence was lower than the threshold set by the user.
The results discussed in this paper, both for MNIST and CASIA, were collected using the
same network architectures displayed in Fig. 4.3 and Fig. 4.13 respectively, unless otherwise
34 Chapter 4. Selective Backpropagation
Figure 4.1: Number of Backpropagations VS Epoch Duration (s) at 0.9 BP filter.
specified.
4.3 Results
The experiment designed wished to investigate what effect only backpropagating in certain
conditions, rather than for every training input in a given epoch, would have on the training
performance curve. One logical measure, therefore, would be the performance on the testing
set as a function of training time. However, the training time is nondeterministic and
therefore may not be the most accurate measure of performance. The exact same test can
be performed on the same system and take a different amount of time to execute. This
duration can be influenced by the quantity and nature of other programs running on the
system and therefore sharing its resources, as well as variables such as the operating system,
4.3. Results 35
Figure 4.2: Epoch duration during training with BP 1.0 under different environment condi-tions.
its scheduler, and the host system’s resources.
Data discussed here was all collected on the same machine. This machine is equipped with
a 4.2GHz Intel Core i7-7700K processor, and an NVIDIA GeForce GTX 1080 Ti GPU
with 11GB GDDR5X SDRAM, 3584 CUDA cores, and a memory bandwidth of 484GBps.
Additionally, it has a 16GB DDR4-3000 RAM, 2TB HDD operating at 5,400 RPM and
500GB SSD with a SATA 6Gbps interface, and runs Windows 10 Pro.
Graphs in Fig. 4.1 and Fig. 4.2 show several runs of the same network structure on the same
data. For a given network structure, weights were always initialized to the same values, since
the random generator used to generate initial weights was not seeded for tests discussed in
this thesis. In the tests for Fig. 4.1, a small convolutional neural network was trained on
36 Chapter 4. Selective Backpropagation
MNIST using the following configuration:
LEARNING_RATE 0.01
BATCH_SIZE 10
MAX_ITERATIONS 2000000
MIN_ERR_DELTA 1.20
MOMENTUM_ALPHA 0.9
ADAPTIVE_LEARNING_RATE 1.0 # adaptive learning rate
disabled
WEIGHT_DECAY 0.0 # weight decay disabled
BP_THRESHOLD 0.9 # BP if wrong or confidence
below 0.9
MAX_FP_DELAY 1 # always forward propagate
AUTO_NORMALIZATION 2 # total energy scaling
DERIVATIVE_NORMALIZATION 0 # no derivative normalization
DATA_AUGMENTATION 1 # data augmentation on
WEIGHT_NORMALIZATION 0 # weight normalization off
DROP_OUT 0 # drop out off
FILE_NAME MNIST_SingleRunAffinity
DATA_INPUT {
DIM_X 28
DIM_Y 28
NUMBER_OF_MAPS 1
}
CONVOLUTIONAL {
NUMBER_OF_MAPS 5
KERNEL_SIZE 3
KERNEL_STRIDE 1
ACTIVATION_FUNCTION relu
}
POOLING {
KERNEL_SIZE 2
4.3. Results 37
POOL_TYPE max
}
CONVOLUTIONAL {
NUMBER_OF_MAPS 10
KERNEL_SIZE 5
KERNEL_STRIDE 1
ACTIVATION_FUNCTION relu
}
POOLING {
KERNEL_SIZE 3
POOL_TYPE max
}
FULLY_CONNECTED {
LAYER_SIZE 50
ACTIVATION_FUNCTION relu
}
FULLY_CONNECTED {
LAYER_SIZE 10
ACTIVATION_FUNCTION softmax
}
In Fig. 4.1, since the BP_THRESHOLD is not set to 1, the number of BPs will vary between
the epochs, allowing a comparison of different BPs completed and training durations of
epochs. Since MAX_FP_DELAY is set to 1, every epoch will forward propagate 54,000 times,
eliminating any unwanted variability. Each epoch propagates 54,000 times since the training
set is composed of 60,000 vectors, of which 10%, or 6,000 vectors are set aside for validation,
leaving 54,000 in the training set.
The only differences between the various runs were in the environment running the tests,
including whether the program’s affinity was set to run on all the cores (default) or on
38 Chapter 4. Selective Backpropagation
a single CPU core (coded as ‘affinity’ in the key), and whether this run was the only one
taking place (along with background programs on the computer, and marked as ‘Single run’)
or other runs were also executing simultaneously (‘Multiple runs’). Other than these system
condition changes, the tests themselves were of the same network structure with identical
setup and inputs.
The output from all the runs is therefore identical, except for the training time. For each
run, every epoch was captured as a data point with the backpropagations performed as its X
value and the epoch’s training duration along the Y axis. This graph shows that in different
runs of the same network, while the BPs performed in a given epoch is deterministic, the
duration is not, although there is a positive linear relationship between the two.
The initial epochs are the ones with the most BPs, starting with 54,000 as every image
is backpropagated, and along time the number of BPs performed decreases. This means
that chronologically, the first epochs are the ones plotted on the right of Fig. 4.1, and as
training continued new epochs were plotted to the left. Noticeably, when the CPU was
only running one run simultaneously, an epoch took approximately 40-45s, as opposed to
70s with multiple simultaneous runs. This decrease in efficiency can be attributed to the
CPU being quad-core, so instead of the processor having a core entirely to itself as in the
single run captures, now four cores are being shared between 7 runs. The performance is not
quite twice as slow, however, which is likely due to optimizations in the system, primarily
Intel’s hyper-threading in the i7 7700K CPU. There is also some variation visible between
different runs where the platform was under similar workload, such as the three single runs.
‘Single run’ and ‘Single run 2’ had no known differences in their setup, yet ‘Single run’ has
noise on it when BP’s are less than 5000, and has a slightly larger slope, meaning BPs took
slightly longer. ‘Single run with affinity’ appears to take slightly less time than ‘Single run
2’, showing that setting the affinity can yield a slight improvement in this case, likely thanks
4.3. Results 39
to more cache hits. Some of the runs exhibit some noise for epochs with few BPs. This is not
constant to all runs and is likely due to the scheduler of the OS. Observations were made,
for example, that printing to the console during training time every so often decreased run
time per epoch, supporting the hypothesis that the scheduler may prioritize certain types of
routines over others in a manner that causes these noises in short epochs’ durations. This
noise affects the R2 values of the plots, as one may expect. For example, ‘Single Run’ has a
lot more noise up to BP 5600, and because of this its R2 is only 0.727, compared to ‘Single
Run 2’ which is an identical run that did not show such noise for low BPs and has an R2
value of 0.997, showing it is much closer to the fitted regression line.
In Fig. 4.2, the only change made from Fig. 4.1 was setting BP_THRESHOLD to 1 instead
of 0.9, meaning the network always backpropagates. In this scenario, since the network
always computes the same number of forward propagations and backpropagations, there is
no variation in the amount of computation performed per epoch, and therefore the runtime of
the different epochs should be constant. This is the observed behavior, as seen in the graph,
in each run. However, running multiple runs simultaneously does cause an increase of the
amount of time needed to complete a single epoch, nearly doubling the epoch duration from
around 41s to about 75s. Additionally, with multiple runs happening in the background, the
epoch duration is not nearly as constant, with epoch duration lasting as long as 88s, while
the single run shows far less variation in epoch durations. The standard deviation of the
multiple runs occurring simultaneously plot is 2.125s and has an R2 value of 5.883 · 10−5,
while the single run has a standard deviation of 0.321s and R2 value of 0.211.
Due to these observations, it can be noted that there is a high correlation between epoch
duration and BPs performed, and while measurements of epoch duration are nondetermin-
istic, a count of BPs performed in a given epoch is deterministic and therefore serves as a
better measurement of the effectiveness of the experiments described here.
40 Chapter 4. Selective Backpropagation
Figure 4.3: Architecture of the CNN used to classify MNIST.
4.3.1 MNIST
The results discussed were obtained by training the network illustrated in Fig. 4.3.
Fig. 4.4 shows the results for training this neural network on MNIST with different selective
BP threshold values. The learning rate used for these runs was 0.1, batch size 10, and
momentum of 0.9. Inputs were normalized and augmented as described in Ch. 3. Fig. 4.4
plots the testing accuracy from these runs as a function of epochs passed. The various curves
have no significant variation between them, with the baseline plateauing at 98.75% and others
around it between 98.6% and 99.0%, with oscillations of around 0.1%, the curves overlap
each other. This shows that ignoring certain inputs does not cause an overall decrease in
performance. Additionally, none of the curves show signs of overlearning, which may have
been a concern from limiting the exposure of the network to inputs. This may be attributed
to the augmentation of inputs. If overlearning was occurring, the testing performance would
eventually begin to decrease after having reached its peak, but instead it appears to plateau.
Fig. 4.5 shows the same runs as Fig. 4.4, plotted as a function of BPs performed rather
than epochs. As discussed earlier, this is a good indication of training time. The graph
4.3. Results 41
Figure 4.4: MNIST testing accuracy with different BP thresholds. When plotting accuracyas function of epochs passed, the curves have no significant variation between them.
shows that when the selective BP threshold is set to 1.0, i.e. always backpropagate, which
is the baseline for these experiments, the curve takes longer to achieve the same testing
classification accuracy as when the BP threshold is reduced. After approximately 15M BPs,
this curve catches up to the others’ performance, but this is long after they plateau, and
the 1.0 BP threshold does not exceed the other curves’ performance. The initial part of
the graph is zoomed in on in Fig. 4.6, where it is clear that all the curves plateau after 1M
to 2.5M backpropagations, at a performance that the baseline begins to reach around 15M.
In these runs, reducing the BP threshold from 1.0 to a value between 0.4 and 0.9 shows a
reduction of BPs performed to 6.67%-16.67% of the baseline. In these figures, a datapoint
was sampled at the end of every epoch. In runs where BP threshold is not set to 1.0, epochs
are composed of less BPs and therefore completing X BPs takes more epochs, thus contains
more sample points. This is the reason that graphs with BP 6= 1.0 appear to be thicker.
Fig. 4.7 shows the number of backpropagations computed when the threshold is set to 0.7,
42 Chapter 4. Selective Backpropagation
Figure 4.5: MNIST Testing accuracy with different BP thresholds. BP 1.0, the baseline ofalways backpropagating, takes longer (more BPs) to achieve the same performance as othercurves on the graph. It catches up but does not show better performance than runs with alower BP threshold in the long run.
Figure 4.6: Zooming in on the initial relevant section of MNIST Testing accuracy withdifferent BP thresholds. BP 1.0, the baseline of always backpropagating, initially significantlyunder performs runs that selectively backpropagate, and takes time to catch up to them afterthey plateau.
4.3. Results 43
Figure 4.7: Plotting both the performance on the testing set as well as the numbers ofBPs performed in each epoch shows the rapid decrease in BPs performed, dropping to below10,000 after only 7 epochs from the initial 54,000, which would remain constant for a baselinetest. By that point, performance reaches 96.04% on the testing set. It goes on to pass 98.7%,the majority of that time spent completing less than 5,000 BPs per epoch.
along with the performance on the testing set, both as a function of epochs completed. This
graph shows the immediate and rapid decrease in BPs performed during training. For a
baseline experiment, the BPs curve would be constant at 54,000. Instead, in this plot the
0.7 BP threshold immediately decreases in BPs, dropping below 10,000 BPs/epoch after only
7 epochs when accuracy is at 96.04% and continues to drop, ultimately reaching below 500
BPs/epoch. The % accuracy achieved on the testing set continues to improve when fewer
BPs are completed, managing to reach 99% for the first time by epoch 355, at which point
only around 1,300 BPs are completed per epoch.
44 Chapter 4. Selective Backpropagation
Figure 4.8: Testing accuracy over BPs on disproportional MNIST dataset.
Disproportional Dataset
One of the advantages of employing this technique is essentially closing the feedback loop
on which vectors should be reinforced and which the network is already good at classifying.
When an effort is put into making sure datasets represent classes equally, this is less necessary,
but that can be hard to do under real-world conditions. For example, when trying to
create a dataset to help predict rare events such as earthquakes, medical emergencies, or
cyber attacks, data collection for when the event is occurring may be much more difficult
to collect at large quantities than data for when the event is not happening. In these
scenarios, imbalanced datasets may be generated. The most common methods to train on
such imbalanced datasets include down-sizing, or under-sampling, the more common classes,
or alternatively over-sampling the underrepresented classes [64, 65, 66, 67]. This behavior
may also be valuable in applications where the network is being trained on live data as it
4.3. Results 45
is being collected, rather than a dataset composed offline, since in such scenarios it is not
possible to ensure equal representation of the different classes.
MNIST specifically is composed of 60,000 training images, with 6,000 training images for
each class, and a testing set of 10,000 images, composed of 1,000 for each label. To test less
ideal conditions, a subset of the MNIST training set was created with an unequal distribution
of labels. In this case, 100% of images with labels 0-8 were saved, but only 1% of images
labeled 9 were randomly selected for the training set. The testing set was not altered, so
as to measure the true performance of the network. The network structure discussed earlier
was then trained with this new dataset. The learning curve with and without the selective
BP algorithm are plotted in Fig. 4.8, which shows BP 0.7 first passing 96% accuracy on the
unbiased testing set after 372k BPs, compared to BP 1.0 which takes 3.6M BPs, or 9.68
times as many BPs (an 89.67% improvement), before they plateau to similar final results.
The overall improvement in time it takes to train the network does not paint the full picture,
however. Since the training set used was biased to include far less pictures of 9’s than any
other class, the distribution of errors between the classes offers valuable insight. Fig. 4.9
shows the error matrix early in the training process for each of the two runs. The Y axis
marks the true class, the X axis marks the neural network’s classification, and the Z axis
marks the number of vectors from the testing set that were classified incorrectly. For the
diagonal where the X value equals the Y value, the errors are always 0, of course, since those
are columns where the NN classification is the true classification, and therefore this cannot
be an error. The two subfigures show the distribution of errors, and it is clear that the
overwhelming majority of errors in both tests is in class 9, which is to be expected as the
network had far less examples of 9’s to study. The two subfigures show a similar distribution
of the errors, with the most common error by far being classifying a picture of a 9 as a 4,
an understandable error to make due to their similar shape, and the second most common
46 Chapter 4. Selective Backpropagation
(a) Disproportional MNIST error matrix early in the training process, with 0.7 BP thresh-old.
(b) Disproportional MNIST error matrix early in the training process, with 1.0 BP thresh-old.
Figure 4.9: Error matrices early in the training process, training a neural network on adisproportional MNIST dataset.
4.3. Results 47
Figure 4.10: Histogram of total misses per class early in the training process.
error being classifying a 9 as a 7, another error that makes sense when 7 is written with a
horizontal line crossing its center. Fig. 4.10 shows the sum of errors in each classification
for the two runs, which clearly shows, similarly to Fig. 4.9, that BP 0.7 was able to make
less than half the mistakes BP 1.0 did. This agrees with Fig. 4.8 in showing that using the
BP threshold results in training converging much faster than when backpropagating every
input.
As Fig. 4.8 shows, however, if the network is given enough training time, the two methods
converge on similar performance. Fig. 4.11 shows a similar error matrix evaluated after
the results plateau, and shows that errors still came overwhelmingly from misclassified 9’s.
While the BP 0.7 seems to have improved at differentiating 4’s and 9’s, reducing that type
of error by more than half, it made little improvement in telling apart an 8 from a 9, while
the BP 1.0’s primary mistake is still classifying 4’s as 9’s, the two network’s performance
overall is equal, as is also seen in Fig. 4.12. While this summation does show BP 0.7
48 Chapter 4. Selective Backpropagation
(a) Disproportional MNIST error matrix early in the training process, with 0.7 BP thresh-old.
(b) Disproportional MNIST error matrix early in the training process, with 1.0 BP thresh-old.
Figure 4.11: Error matrices at the end of the training process, training a neural network ona disproportional MNIST dataset.
4.3. Results 49
Figure 4.12: Histogram of total misses per class early in the training process.
slightly outperforming BP1.0 in classifying 9’s, this difference is likely negligible and could
probably be captured reversed or at least further reduced if a few more random captures
were generated for another couple of epochs. The fact that the BP 1.0 network was still able
to converge at over 96% accuracy, compared to around 98.8% achieved when training on the
regular set, is a rather impressive feat that speaks to the ability of convolutional networks.
It is possible that if this system was further tested to its limits with a smaller network with
less weights and/or a more extreme dataset, not only would the difference in speed to reach
peak performance widen between the two, but the non-1.0 BP would potentially outperform
the 1.0 BP threshold. Additionally, it is likely that with more variety in the augmentation,
either as additional forms of augmentation or a larger range of performance (i.e. further
rotating, etc.), the peak performance achieved could be even higher than that achieved here.
50 Chapter 4. Selective Backpropagation
Figure 4.13: Architecture of the CNN used to classify CASIA.
Figure 4.14: CASIA testing accuracy with different BP thresholds. When plotting accuracyas function of epochs passed, the curves show no significant variation between them.
4.3. Results 51
4.3.2 CASIA
The same tests described above were performed on the CASIA dataset described in Sec-
tion 2.2, using the network architecture shown in Fig. 4.13. Similarly to the MNIST per-
formance, Fig. 4.14 shows that results plotted over epochs show no significant variation in
performance. However, epochs composed of less BPs are able to complete faster. Fig. 4.15
shows the performance of these runs on the testing set, plotted as a function of BPs per-
formed, instead of epochs. This graph shows the 1.0 BP threshold curve staggering behind
the other curves, with the gap appearing to grow as training continues. Fig. 4.16 offers a
closer look at the relevant section of this graph. It can be seen that while BP below 1.0
curves cross the 90% accuracy after 1M BPs, the baseline curve does so after 1.64M curves.
Additionally, while the 1.0 BP threshold curve only reaches 92% accuracy for the first time
after 2.88M BPs, other curves reach such performance after 1.1M to 1.3M BPs, showing a
54.86%-61.81% decrease of BPs performed to achieve that level of accuracy of the testing
set. Furthermore, the trend lines appear to have an expanding gap between them, indicating
that this margin of difference would likely continue to expand with further BPs. This data
agrees with the overall performance seen on MNIST in Section 4.3.1, showing that any BP
threshold below 1.0 outperforms the baseline in training time needed to achieve a given level
of accuracy. While the extent of time improvement is a function of the threshold, and differs
across the two datasets, this shows that the algorithm does reduce training time on small
problems such as MNIST, with 28x28 inputs and only 10 outputs, as well as bigger networks
such as the one used here for CASIA, with 48x48 inputs and 100 outputs.
52 Chapter 4. Selective Backpropagation
Figure 4.15: CASIA Testing accuracy with different BP thresholds. BP 1.0, the baseline ofalways backpropagating, stagnates behind all other curves, taking at least 2.2 times as longto reach 92% testing accuracy.
Figure 4.16: Zooming in on the relevant section of CASIA Testing accuracy with differentBP thresholds. BP 1.0, the baseline of always backpropagating, significantly under performsruns that selectively backpropagate, and shows no sign of catching up to the rate at whichthey improve.
Chapter 5
Selective Forward Propagation
5.1 Motivation
Chapter 4 discusses only backpropagating when the network is wrong or with a low confi-
dence. Doing so reduces the need to perform computation to alter the network when the
benefit would be minimal, and can reduce the computation on a given input by approximately
one half. However, there is still potential to further reduce training time. An argument could
be made that when an input vector is forward propagated but not backpropagated, the state
of the network remains unchanged and therefore the forward propagation is redundant. If
there was a way to predict when an input vector will be identified correctly and not back-
propagated, then those forward propagations could be avoided. This could reduce training
time by a much larger margin, and training time saved on those forward propagations can be
better used on the more challenging inputs. The reason this can potentially reduce training
time is as follows: normally, for N vectors in a training set, an epoch would be made up of:
Ec = N · FPc +N ·BPc (5.1)
53
54 Chapter 5. Selective Forward Propagation
where Ec, FPc, and BPc represent the computation required for a single epoch, FP, and BP,
respectively. When we reduced the BP threshold from 1.0 in Chapter 4, this offered:
Ec = N · FPc + a ·N ·BPc (5.2)
for some a ≤ 1.0, where a represents the percentage of vectors from the ones forward prop-
agated that were then also backpropagated. However, in introducing the filtered FP, we
get:
Ec = b ·N · FPc + a · b ·N ·BPc (5.3)
for some a, b ≤ 1.0, where b represents the percentage of vectors from the entire set that
were forward propagated, and therefore, considered for backpropagation. This means that
the reduction offered by introducing b to the equation reduces both the number of BPs and
of FPs performed, unlike before when a only affected the number of BPs. Effectively, there
could have been a situation with a massive training set, where very few vectors were still not
passing successfully. If always forward propagating and only selectively backpropagating, Ec
would be reaching towards Ec ≈ N · FPc. If, however, the FP estimation works well, this
may be further reduced by reducing the FPs that take place as well, as now the amount of
time that was spent on forward propagating would take up the majority of training time.
It is important to note that when backpropagation was skipped, there was definitive proof
that the network was well trained on the given vector, hence reason to believe that the cost-
benefit analysis of backpropagation would not be worth the expected return on investment
of training time and resources compared to other potential vectors. In the case of skipping
a forward propagation, however, the decision is based solely on a prediction. Additionally,
in certain scenarios, skipping backpropagation may offer an improvement in absolute per-
formance compared to always backpropagating, such as if many inputs of the same class
5.2. Procedure 55
Table 5.1: Analysis of BP and FP combinations, showing potential benefits in performanceand time reduction.
Always FP Selective FP
AlwaysBP
Standard method.Time: better potential improvementPeak performance: minimal potentialimprovement
SelectiveBP
Time: some potential improvementPeak performance: better potential im-provement (Discussed in Chapter 4)
Time: best potential improvementPeak performance: same as Always FP,Selective BP
move the network weights away from the training completed by a much less frequent class.
Therefore, the method discussed in Chapter 4 may improve both training time as well as po-
tentially final performance on a test set. Avoiding forward propagating certain input vectors
based on predictions offers no further potential improvement of classification performance
over employing the BP filter, but does offer such improvement over the baseline of always
forward propagating and backpropagating, and also offers a potentially greater reduction of
training time than always forward propagating but not always backpropagating, which was
already shown to reduce training time over the baseline of always forward propagating and
backpropagating. These scenarios are represented in Table 5.1
5.2 Procedure
Similarly to Chapter 4, the confidence from forward propagating a certain vector is used here
to determine whether or not future computations can be reduced. The model implemented
here determines for each training input how many epochs can likely be skipped before there
is a need to re-examine the given training input vector. This requires storing N values, where
N is the number of training vectors. Initially, Dn, the delay for vector n, measured in epochs,
is set to 0, so all vectors forward propagate at least once. Then, based on their performance,
a delay can be calculated. Two variables are created for this. First, a maximum FP delay (in
56 Chapter 5. Selective Forward Propagation
epochs) is defined as the maximum number of epochs that may take place before a vector is
re-examined. Additionally, a FP threshold fraction is defined, similarly to the BP threshold,
where if the confidence is below the threshold then the delay is set to forward propagate
again during the next epoch, since performance on it is currently unsatisfactory. This also
happens if the classification is incorrect, regardless of the confidence. If the classification is
correct, and the confidence is above the threshold, then the delay is calculated as follows:
Dn = Dmax ·Cn − Tfp1− Tfp
(5.4)
where Dn is the new delay for frame n, Dmax is the maximum FP delay, measured in epochs,
Cn is the confidence on frame n, which was correctly classified, and Tfp is the FP threshold
fraction. After the first epoch, in which all training vectors were propagated, there may be
some with delays greater than 1. When training, if the network encounters such a vector,
its delay is decremented by one and it is skipped. When the network encounters a vector
whose delay is 1, it forward propagates it again, calculates a new delay, and backpropagates
it, if the appropriate conditions are met.
A baseline criteria for this setup, therefore, is setting Dmax to 1, meaning the vector will
be reexamined at the next epoch, so no vector is ever skipped. When this is the case, Tfp
becomes irrelevant. By defining some non infinite Dmax, however, we ensure that every
vector will eventually be forward propagated again, and is not taken out of the training pool
after being classified correctly once, since weights continue to change and in the future the
vector may not be classified correctly, or may result in a lower confidence.
A potential issue with this algorithm is its clash with data augmentation. Data augmentation
seeks to create new training vectors from the existing ones, in order to expand the versatility
of inputs the network is exposed to during training. However, when a new delay is calculated,
5.3. Results 57
it is done based on the particular input that was propagated, and not on other augmentations
of it that may be procured in the future and result in other classifications or levels of
confidence. This means that estimating a delay based only on confidence on a single vector
may be less effective on training routines with augmentation.
Another point to consider is that calculating delays in epochs represents how many epochs
may pass before a vector is forward propagated again. However, depending on the size of
the training set, the batch size, and how many input vectors result in a forward propagation
and/or backpropagation during those epochs, in different applications this may mean that
in the same number of epochs, there is a different number of times that weights have been
updated. This could be solved by altering the delay from being calculated in epochs to
calculating how many weight updates may be skipped before re-examining the vector.
When a vector is skipped during training, a forward propagation is avoided, as well as a
backpropagation, hence the increased value in not forward propagating. If the BP threshold
is not set to 1.0 while the FP selection algorithms are enabled, then some vectors may be
forward propagated and backpropagated, some may be avoided entirely, and some may be
forward propagated but not backpropagated. Baseline performances in these experiments
involve always forward propagating, and always backpropagating. In Chapter 4, the scenario
where all vectors are forward propagated and only some backpropagate was introduced and
examined. This chapter studies the scenario of only forward propagating some vectors, but
backpropagating all of those, in order to specifically measure this algorithm separately from
the work discussed in Ch. 4. The two algorithms can however be used jointly.
58 Chapter 5. Selective Forward Propagation
Figure 5.1: Number of Forward Propagations VS Epoch Duration (s) at FP Threshold=0.5and Max Delay=15, showing a positive linear relationship between FPs and epoch duration.
5.3 Results
Similarly to Fig. 4.1 showing the relationship between BPs and epoch duration, Fig. 5.1
shows that the duration in seconds of each epoch from a training routine corresponds linearly
to the forward propagations performed in that epoch, with an R2 value of 0.993. Following
the same reasoning as before, data plotted in this chapter is also a function of propagations
completed, rather than time.
5.3.1 MNIST
Fig. 5.2 shows the testing accuracy curves for different FP max delays while training on the
MNIST dataset. FP Max Delay of 1, in this case, means every vector is always forward
propagated and as such serves as the baseline. It can be seen that the plots plateau around
98.6% accuracy. Zooming in on the initial section of this plot in Fig. 5.3, the FP 1 curve is
5.3. Results 59
Figure 5.2: Testing accuracy over FPs on the standard MNIST dataset with different FP maxdelays. In all runs, FP threshold is set to 0.5. Over time, all runs converge to approximatelythe same level of performance.
Figure 5.3: Zooming in on the initial relevant section of MNIST testing accuracy withdifferent FP max delays, with FP threshold set to 0.5. FP 1, the baseline of always forwardpropagating and always backpropagating, lags behind other curves.
60 Chapter 5. Selective Forward Propagation
Figure 5.4: Training on different subsets of MNIST shows using selective BP and FP algo-rithms result in increased generalization variability over the baseline.
seen lagging behind all other curves. While those pass 98% testing accuracy after 451k-775k
FPs and BPs, the FP 1 curve successfully does so after 1.664M FPs and BPs. The FP plots
shown display a reduction of FPs (and therefore also BPs) by as much as 73%.
Generalization
Fig. 5.4 shows the variation in generalization when using the selective BP and FP algorithms.
The MNIST training set of 60,000 images was split into 10 subsets of 6,000 images each. Of
those, 5,400 were used for training and 600 to objectively determine when to stop training.
Then, the network was used to measure the performance on the independent set of 10,000
images. This way, 10 datapoints were generated for each threshold value. This plot shows the
baseline, where BP threshold is 1.0 and the FP Dmax is 1 (i.e., always forward propagating
and always backpropagating), has less variability than the other plots, with a standard
deviation of 0.184, while others vary from 0.198 (FP Dmax 5) to 0.344 (FP Dmax 3).
5.3. Results 61
5.3.2 CASIA
The same experiment of varying the maximum FP delay was conducted on the CASIA
subset described in Section 2.2. The results are plotted in Fig. 5.5, which shows the plots
all plateaued around 92.5% accuracy. The key specifies both the FP threshold and FP max
delays, so FP0.8,5, for example, means FP threshold is set to 0.8 and FP max delays is set
to 5. Fig. 5.6 shows a closeup of the critical part of this graph, where the baseline FP 0.8,
1 is seen lagging behind the other curves before they all plateau. Specifically, this curve
passes 91% testing accuracy after 2.47M FPs, while the other curves do so between 1.47M
and 2.0M FPs, showing a reduction of 19.0%-40.5% of FPs and BPs performed.
62 Chapter 5. Selective Forward Propagation
Figure 5.5: Testing accuracy over FPs on the CASIA subset dataset with various FP maxdelays with FP threshold set to 0.5. Over time, runs converge to approximately the sameperformance on the testing set.
Figure 5.6: Zooming in on the initial relevant section of Fig. 5.5. FP max delay of 1, thethreshold, is seen lagging behind the other curves before they all plateau.
Chapter 6
Future Work
6.1 Future Work
The work discussed shows promise, but warrants further work to study additional possible
improvements that could be made. There is also a multitude of scenarios to be further
examined.
The selective backpropagation and forward propagation algorithms are designed to accelerate
the rate at which the network trains, and were tested on CNNs started from random weights.
These algorithms, however, can also be applied to other algorithms that use backpropagation
and forward propagation in their training, such as recurrent neural networks. They may also
be beneficial for transfer learning. These scenarios have not been tested and warrant future
work. These algorithms can also be further tested on other datasets, including problems
that do not involve image classification problems.
In tests performed, the selective thresholds for FP and BP were kept constant. However,
many different elements and variables of neural network training are modified during the
training process. This can include pruning and growing networks [60, 68, 69], altering the
learning rate [70, 71], and using dropout [29] and weight decay [58, 72, 73]. Similarly, it may
be possible to gain a higher level of testing accuracy from altering the BP and FP thresholds
during the training. For example, using the algorithms for the first 50 epochs, or after the
63
64 Chapter 6. Future Work
learning curve appears to plateau, then changing both thresholds to 1.0, thus reverting to
‘normal’ training.
More work can be done to further examine the improvement these algorithms offer. This
includes further fine-tuning of adjustable variables (e.g. momentum, learning rate, weight
decay, etc.), as well as applying these methods to additional datasets. Additionally, testing
different sized network with these algorithms could yield interesting results. Bigger (and
deeper) networks will likely achieve better results, but employing these algorithms may offer
a way to train smaller networks and achieve the same performance by essentially stress-testing
how many weights are needed to get a certain level of performance. This may be valuable,
for example, in embedded applications where there is limited memory and computational
resources. The work in this thesis tested each algorithm separately, so as to independently
assess the improvement offered by each. However, they can also be combined for further
reduction of training time.
Other algorithms could be designed based on the principles discussed in this paper. While the
work here relied largely on the confidence of a classification to determine whether or not to
backpropagate, or when to re-examine a given vector, other methods to make these decisions
could be implemented, including gathering statistics for how well a given category is being
classified, how long the network has been training, and other information available during
the training routine. Creating a more robust estimation system using more parameters may
also improve performance on augmented vectors, as discussed in Section 5.1. As explained
in Section 5.1, measuring FP delay in terms of weight updates rather than epochs may also
improve performance on FP estimation, regardless of whether or not data is augmented. In
fact, perhaps the method that would yield best results on a broad range of application would
be not to hand-design these BP and FP filters, but rather to train a ML algorithm such as
a small neural network to look at all these parameters and replace these systems in deciding
6.1. Future Work 65
whether or not to backpropagate, or how long to wait before forward propagating again.
Finally, an additional undertaking would be to implement these algorithms on a framework
that is capable of running on a GPU. As discussed in Section 3.1, this was the initial intention
for this work but proved to be more challenging than initially anticipated, so to test the
algorithms’ potential yield they were implemented in a serial batch fashion to first study
their potential impact. Now that this work has been done and it has been observed that these
selective algorithms offer a reduction in computation required in training and do not impact
peak performance levels, further work can go into integrating these selective algorithms into
a GPU based framework.
Chapter 7
Conclusions
7.1 Conclusions
Neural networks’ training time can take days or even weeks. This thesis discusses altering
the training routine in order to reduce the time to train a NN, with some impact to th gen-
eralization variability. The modifications proposed are twofold; the first, discussed in Ch. 4,
is to only backpropagate on a given vector if it was classified incorrectly or if the confidence
was below a certain threshold, as opposed to the common method of backpropagating every
vector. The reasoning behind this change is that the network already performs well on this
vector, and would see a greater improvement from spending that training time backprop-
agating other vectors that do not perform as well. This offers a closed feedback loop that
may especially facilitate training on dataset that do not have an equal representation of all
Table 7.1: Summary of time improvements achieved with selective BP and selective FP,including on modified imbalanced MNIST.
% Accuracy atcomparison
Number of propagations % Reduction inpropagations
Baseline ModifiedBP MNIST 98.8% 15M 1-2.5M 83.3-93.3%BP Modified MNIST 96.0% 3.6M 372k 89.67%BP CASIA 92.0% 2.88M 1.1-1.3M 54.86-61.81%FP MNIST 98.0% 1.664M 451-775k 53.43-72.90%FP CASIA 91% 2.47M 1.47-2.0M 19.03-40.49%
66
7.1. Conclusions 67
classes, or where some classes prove harder to classify than others. The second algorithm
proposed is discussed in Ch. 5, and involves predicting when a certain vector should be for-
ward propagated again. This idea stems from the fact that with the selective BP algorithm,
when a vector skips a BP, it does not change the state of the network, and hence the FP that
takes place in order to decide whether or not to BP wastes training time without improving
the NN. By making a prediction as to how many epochs would likely pass before the specific
vector is backpropagated again, the network can avoid the forward propagations up until
that time, thus avoiding both the FP and BP time, which can be better served on other
inputs. Both algorithms were tested on the MNIST and CASIA datasets. The BP dataset
was also tested on a modified MNIST dataset where not all classes are equally represented
in the training set, and the CASIA dataset was used to create a subset of 100 labels so as
to reduce data collection time. Results from all of these tests are analyzed in the appro-
priate chapters, and collectively summarized in Table 7.1. The BP algorithm showed an
83.3-93.3% improvement of backpropagations completed to achieve a given level of accuracy
on the classic MNIST dataset, an 89.67% improvement on the modified MNIST dataset, and
a 54.86-61.81% improvement on the CASIA subset. The selective FP algorithm, with every
avoidance of FP, also avoids a BP, and showed a reduction of propagations completed of
53.43-72.90% on the MNIST dataset and 19.03-40.49% on CASIA.
Bibliography
[1] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber, “Mitosis Detection
in Breast Cancer Histology Images using Deep Neural Networks,” Proc Medical Image
Computing Computer Assisted Intervenction (MICCAI), pp. 411–418, 2013.
[2] A. Esteva, B. Kuprel, R. A. Novoa, J. Ko, S. M. Swetter, H. M. Blau, and S. Thrun,
“Dermatologist-level classification of skin cancer with deep neural networks,” Nature,
vol. 542, no. 7639, pp. 115–118, 2017.
[3] Waymo, “On the road to Fully Self-driving,” Waymo Safety Report, p. 43, 2017.
[4] “Tesla Autopilot.” https://www.tesla.com/autopilot, 2016.
[5] Y. Tian, K. Pei, S. Jana, and B. Ray, “DeepTest: Automated Testing of Deep-Neural-
Network-driven Autonomous Cars,” 2017.
[6] D. C. Ciresan, U. Meier, and J. Schmidhuber, “Transfer Learning for Latin and Chinese
Characters with Deep Neural Networks,”
[7] I. J. Goodfellow, Y. Bulatov, J. Ibarz, S. Arnoud, and V. Shet, “Multi-digit Number
Recognition from Street View Imagery using Deep Convolutional Neural Networks,”
pp. 1–13, 2013.
[8] A. Buczak and E. Guven, “A survey of data mining and machine learning methods
for cyber security intrusion detection,” IEEE Communications Surveys & Tutorials,
vol. PP, no. 99, p. 1, 2015.
[9] M. H. Bhuyan, D. K. Bhattacharyya, and J. K. Kalita, “Network Anomaly Detection:
68
BIBLIOGRAPHY 69
Methods, Systems and Tools,” Communications Surveys & Tutorials, IEEE, vol. 16,
no. 1, pp. 303–336, 2014.
[10] Y. Taigman, M. Yang, and M. Ranzato, “Deepface: Closing the gap to human -level
performance in face verification,” CVPR IEEE Conference, pp. 1701–1708, 2014.
[11] C. Clancy, J. Hecker, E. Stuntebeck, and T. O’Shea, “Applications of Machine Learning
to Cognitive Radio Networks,” IEEE Wireless Communications, vol. 14, no. 4, pp. 47–
52, 2007.
[12] T. Yucek and H. Arslam, “A Survey of Spectrum Sensing Algorithms for Congnitive
Radio Applications,” Proceedings of the IEEE, vol. 97, no. 5, pp. 805–823, 2009.
[13] M. Bkassiny, Y. Li, and S. K. Jayaweera, “A survey on machine-learning techniques in
cognitive radios,” IEEE Communications Surveys and Tutorials, vol. 15, no. 3, pp. 1136–
1159, 2013.
[14] D. Ciresan, U. Meier, and J. Schmidhuber, “Multi-column deep neural networks for im-
age classification,” in Computer Vision and Pattern Recognition (CVPR), no. February,
pp. 3642–3649, 2012.
[15] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-
level performance on imagenet classification,” Proceedings of the IEEE International
Conference on Computer Vision, vol. 2015 Inter, pp. 1026–1034, 2015.
[16] D. C. Ciresan, U. Meier, L. M. Gambardella, and J. Schmidhuber, “Deep Big Simple
Neural Nets Excel on Handwritten Digit Recognition,” pp. 1–14, 2010.
[17] K. Ovtcharov, O. Ruwase, J.-y. Kim, J. Fowers, K. Strauss, and E. S. Chung, “Accel-
erating Deep Convolutional Neural Networks Using Specialized Hardware,” Microsoft
Research Whitepaper, pp. 3–6, 2015.
70 BIBLIOGRAPHY
[18] A. Ling, D. Capalija, and G. Chiu, “Accelerating Deep Learning with the OpenCL
Platform and Intel Stratix 10 FPGAs,” tech. rep., Intel, 2015.
[19] J. P. Research, “GPU Developments 2017,” tech. rep., Jon Peddie Research, 2018.
[20] N. P. Jouppi, A. Borchers, R. Boyle, P.-l. Cantin, C. Chao, C. Clark, J. Coriell, M. Daley,
M. Dau, J. Dean, B. Gelb, C. Young, T. V. Ghaemmaghami, R. Gottipati, W. Gulland,
R. Hagmann, C. R. Ho, D. Hogberg, J. Hu, R. Hundt, D. Hurt, J. Ibarz, N. Patil, A. Jaf-
fey, A. Jaworski, A. Kaplan, H. Khaitan, D. Killebrew, A. Koch, N. Kumar, S. Lacy,
J. Laudon, J. Law, D. Patterson, D. Le, C. Leary, Z. Liu, K. Lucke, A. Lundin, G. MacK-
ean, A. Maggiore, M. Mahony, K. Miller, R. Nagarajan, G. Agrawal, R. Narayanaswami,
R. Ni, K. Nix, T. Norrie, M. Omernick, N. Penukonda, A. Phelps, J. Ross, M. Ross,
A. Salek, R. Bajwa, E. Samadiani, C. Severn, G. Sizikov, M. Snelham, J. Souter,
D. Steinberg, A. Swing, M. Tan, G. Thorson, B. Tian, S. Bates, H. Toma, E. Tuttle,
V. Vasudevan, R. Walter, W. Wang, E. Wilcox, D. H. Yoon, S. Bhatia, and N. Boden,
“In-Datacenter Performance Analysis of a Tensor Processing Unit,” ACM SIGARCH
Computer Architecture News, vol. 45, no. 2, pp. 1–12, 2017.
[21] A. P. Engelbrecht, “Sensitivity analysis for selective learning by feedforward neural
networks,” Fundamenta Informaticae, vol. 46, no. 3, pp. 219–252, 2001.
[22] M. T. Vakil-Baghmisheh and N. Pavesic, “Training RBF networks with selective back-
propagation,” Neurocomputing, vol. 62, no. 1-4, pp. 39–64, 2004.
[23] M. P. Craven, “A Faster Learning Neural Network Classifier Using Selective Back-
propagation,” Proceedings of the Fourth IEEE International Conference on Electronics,
Circuits and Systems, vol. 1, pp. 254–258, 1997.
[24] T. Y. Lin, P. Goyal, R. Girshick, K. He, and P. Dollar, “Focal Loss for Dense Object
BIBLIOGRAPHY 71
Detection,” Proceedings of the IEEE International Conference on Computer Vision,
vol. 2017-Octob, pp. 2999–3007, 2017.
[25] A. Shrivastava, A. Gupta, and R. Girshick, “Training Region-based Object Detectors
with Online Hard Example Mining,” 2016.
[26] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to
document recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278–2323, 1998.
[27] Y. LeCun and C. Cortes, “MNIST handwritten digit database,” 2010.
[28] I. Witten, E. Frank, M. Hall, and C. Pal, “Data mining: Practical machine learning
tools and techniques,” 2016.
[29] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout:
A Simple Way to Prevent Neural Networks from Overfitting,” Journal of Machine
Learning Research, vol. 15, pp. 1929–1958, 2014.
[30] Q. V. Le, N. Jaitly, and G. E. Hinton, “A Simple Way to Initialize Recurrent Networks
of Rectified Linear Units,” pp. 1–9, 2015.
[31] C. L. Liu, F. Yin, D. H. Wang, and Q. F. Wang, “CASIA online and offline Chi-
nese handwriting databases,” Proceedings of the International Conference on Document
Analysis and Recognition, ICDAR, pp. 37–41, 2011.
[32] Dalbir and S. K. Singh, “Review of Online & Offline Character Recognition,” Interna-
tional Journal Of Engineering And Computer Science, vol. 4, no. 5, pp. 11729–11732,
2015.
[33] C. L. Liu, F. Yin, D. H. Wang, and Q. F. Wang, “Chinese handwriting recognition con-
test 2010,” 2010 Chinese Conference on Pattern Recognition, CCPR 2010 - Proceedings,
no. November, pp. 1100–1104, 2010.
72 BIBLIOGRAPHY
[34] D. Ciresan and J. Schmidhuber, “Multi-Column Deep Neural Networks for Offline Hand-
written Chinese Character Classification Multi-Column Deep Neural Networks for Of-
fline Handwritten Chinese Character Classification,” 2013.
[35] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado,
A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Is-
ard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mane, R. Monga,
S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Tal-
war, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viegas, O. Vinyals, P. Warden, M. Wat-
tenberg, M. Wicke, Y. Yu, and X. Zheng, “TensorFlow: Large-scale machine learning
on heterogeneous systems,” 2015.
[36] F. Chollet and others, “Keras.” https://keras.io, 2015.
[37] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama,
and T. Darrell, “Caffe: Convolutional Architecture for Fast Feature Embedding,” arXiv
preprint arXiv:1408.5093, 2014.
[38] The Theano Development Team, R. Al-Rfou, G. Alain, A. Almahairi, C. Angermueller,
D. Bahdanau, N. Ballas, F. Bastien, J. Bayer, A. Belikov, A. Belopolsky, Y. Ben-
gio, A. Bergeron, J. Bergstra, V. Bisson, J. B. Snyder, N. Bouchard, N. Boulanger-
Lewandowski, X. Bouthillier, A. de Brebisson, O. Breuleux, P.-L. Carrier, K. Cho,
J. Chorowski, P. Christiano, T. Cooijmans, M.-A. Cote, M. Cote, A. Courville, Y. N.
Dauphin, O. Delalleau, J. Demouth, G. Desjardins, S. Dieleman, L. Dinh, M. Ducoffe,
V. Dumoulin, S. E. Kahou, D. Erhan, Z. Fan, O. Firat, M. Germain, X. Glorot,
I. Goodfellow, M. Graham, C. Gulcehre, P. Hamel, I. Harlouchet, J.-P. Heng, B. Hidasi,
S. Honari, A. Jain, S. Jean, K. Jia, M. Korobov, V. Kulkarni, A. Lamb, P. Lamblin,
E. Larsen, C. Laurent, S. Lee, S. Lefrancois, S. Lemieux, N. Leonard, Z. Lin, J. A.
BIBLIOGRAPHY 73
Livezey, C. Lorenz, J. Lowin, Q. Ma, P.-A. Manzagol, O. Mastropietro, R. T. McGib-
bon, R. Memisevic, B. van Merrienboer, V. Michalski, M. Mirza, A. Orlandi, C. Pal,
R. Pascanu, M. Pezeshki, C. Raffel, D. Renshaw, M. Rocklin, A. Romero, M. Roth,
P. Sadowski, J. Salvatier, F. Savard, J. Schluter, J. Schulman, G. Schwartz, I. V.
Serban, D. Serdyuk, S. Shabanian, . Simon, S. Spieckermann, S. R. Subramanyam,
J. Sygnowski, J. Tanguay, G. van Tulder, J. Turian, S. Urban, P. Vincent, F. Visin,
H. de Vries, D. Warde-Farley, D. J. Webb, M. Willson, K. Xu, L. Xue, L. Yao, S. Zhang,
and Y. Zhang, “Theano: A Python framework for fast computation of mathematical
expressions,” pp. 1–19, 2016.
[39] A. Paszke, G. Chanan, Z. Lin, S. Gross, E. Yang, L. Antiga, and Z. Devito, “Automatic
differentiation in PyTorch,” Advances in Neural Information Processing Systems 30,
no. Nips, pp. 1–4, 2017.
[40] J. Zacharias, M. Barz, and D. Sonntag, “A Survey on Deep Learning Toolkits and
Libraries for Intelligent User Interfaces,” 2018.
[41] T. Nomi, “tiny-dnn.” https://github.com/tiny-dnn/tiny-dnn, 2017.
[42] C. Nvidia, “Nvidia CUDA C Programming Guide PG-02829-001 v9.1,” 2018.
[43] F. Kintz, “GPU Performance Enhancement.” https://wiki.tum.de/display/lfdv/GPU
+Performance+Enhancement/, 2017.
[44] H. Chauhan, “Nvidia Is Running Away With the GPU Market.”
https://www.fool.com/investing/2017/12/06/nvidia-is-running-away-with-the-gpu-
market.aspx, 2017.
[45] NVIDIA, “GEFORCE GTX 1080 Ti.” https://www.nvidia.com/en-
us/geforce/products/10series/geforce-gtx-1080-ti/, 2017.
74 BIBLIOGRAPHY
[46] W. Nvidia, N. Generation, and C. Compute, “Fermi white paper,” ReVision, vol. 23,
no. 6, pp. 1–22, 2009.
[47] D. Kirk, “NVIDIA cuda software and gpu parallel computing architecture,” Proceedings
of the 6th international symposium on Memory management - ISMM ’07, pp. 103–104,
2007.
[48] P. Warden, “Why GEMM is at the Heart of Deep Learning.”
https://petewarden.com/2015/04/20/why-gemm-is-at-the-heart-of-deep-learning/,
2015.
[49] X. Li, G. Zhang, H. H. Huang, Z. Wang, and W. Zheng, “Performance Analysis of
GPU-Based Convolutional Neural Networks,” 2016 45th International Conference on
Parallel Processing (ICPP), pp. 67–76, 2016.
[50] S. Hadjis, F. Abuzaid, C. Zhang, and C. Re, “Caffe con Troll,” in Proceedings of the
Fourth Workshop on Data analytics in the Cloud - DanaC’15, pp. 1–4, 2015.
[51] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shel-
hamer, “cuDNN: Efficient Primitives for Deep Learning,” pp. 1–9, 2014.
[52] F. Abuzaid, S. Hadjis, C. Zhang, and C. Re, “Caffe con Troll: Shallow Ideas to Speed
Up Deep Learning,” 2015.
[53] J. Keuper and F. J. Preundt, “Distributed training of deep neural networks: Theoret-
ical and practical limits of parallel scalability,” Proceedings of MLHPC 2016: Machine
Learning in HPC Environments - Held in conjunction with SC 2016: The Interna-
tional Conference for High Performance Computing, Networking, Storage and Analysis,
pp. 19–26, 2017.
BIBLIOGRAPHY 75
[54] D. C. Ciresan, “Simple C/C++ code for training and testing MLPs and CNNs.”
http://people.idsia.ch/˜ciresan/data/net.zip.
[55] M. Tavallaee, , and E. W. a. G. A. A. Lu, “A detailed analysis of the KDD CUP 99
data set,” no. Cisda, pp. 1–6, 2009.
[56] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learining Internal Representations
by Error Propagation,” 1986.
[57] D. Ciresan, U. Meier, J. Masci, L. M. Gambardella, and J. Schmidhuber, “Flexible, High
Performance Convolutional Neural Networks for Image Classification,” International
Joint Conference on Artificial Intelligence (IJCAI) 2011, pp. 1237–1242, 2011.
[58] A. Krizhevsky, I. Sutskever, and H. Geoffrey E., “ImageNet Classification with Deep
Convolutional Neural Networks,” Advances in Neural Information Processing Systems
25 (NIPS2012), p. 19, 2012.
[59] Y. L. Boureau, J. Ponce, and Y. LeCun, “A theoretical analysis of feature pooling in
visual recognition,” 27th International Conference on Machine Learning, 2010.
[60] J. Principe, N. Euliano, and W. Lefebvre, Neural and Adaptive Systems: Fundamentals
Through Simulation: Multilayer Perceptrons. 1997.
[61] S. H. Djork Arne Clevert, Thomas Unterthiner, “Fast And Accurate Deep Network
Learning By Exponential Linear Units (ELUs),” in ICLR, vol. 285, pp. 1760–1761,
2016.
[62] N. S. Keskar, D. Mudigere, J. Nocedal, M. Smelyanskiy, and P. T. P. Tang, “On Large-
Batch Training for Deep Learning: Generalization Gap and Sharp Minima,” pp. 1–16,
2016.
76 BIBLIOGRAPHY
[63] Y. LeCun, L. Bottou, G. B. Orr, and K. R. Muller, “Neural Networks: Tricks of the
Trade,” Springer Lecture Notes in Computer Sciences, no. December, 1998.
[64] M. M. Rahman and D. N. Davis, “Addressing the Class Imbalance Problem in Medical
Datasets,” International Journal of Machine Learning and Computing, pp. 224–228,
2013.
[65] F. Provost, “Machine learning from imbalanced data sets 101,” Proceedings of the
AAAI’2000 Workshop on . . . , p. 3, 2000.
[66] N. Japkowicz, “Learning from Imbalanced Data Sets: A Comparison of Various Strate-
gies,” AAAI workshop on learning from imbalanced data sets, vol. 68, pp. 10–15, 2000.
[67] Z. H. Zhou and X. Y. Liu, “Training cost-sensitive neural networks with methods ad-
dressing the class imbalance problem,” IEEE Transactions on Knowledge and Data
Engineering, vol. 18, no. 1, pp. 63–77, 2006.
[68] R. Reed, “Pruning Algorithms - A Survey,” IEEE Transactions on Neural Networks,
vol. 4, no. 5, pp. 740–747, 1993.
[69] B. Fritzke, “Growing cell structures-A self-organizing network for unsupervised and
supervised learning,” Neural Networks, vol. 7, no. 9, pp. 1441–1460, 1994.
[70] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R. Salakhutdinov,
“Improving neural networks by preventing co-adaptation of feature detectors,” pp. 1–
18, 2012.
[71] R. A. Jacobs, “Increased rates of convergence through learning rate adaptation,” Neural
Networks, vol. 1, no. 4, pp. 295–307, 1988.
[72] A. Krogh and J. A. Hertz, “A Simple Weight Decay Can Improve Generalization,”
Advances in Neural Information Processing Systems, vol. 4, pp. 950–957, 1992.
BIBLIOGRAPHY 77
[73] G. E. Hinton, “Learning translation invariant in massively parallel networks,” in Pro-
ceedings of PARLE Conference on Parallel Architectures and Languages Europe, pp. 1–
13, 1987.
Appendix A
Mathematical Derivations
A.1 Forward and Backpropagation
The math for the training works as follows: the loss function, or overall error, is defined
as the average of errors for individual patterns, or input vectors (A.1). The error of each
pattern is defined as a sum of squares of the difference between the measured and target
outputs for each output of the network. In the case of digit recognition, this would be the
sum of 10 squares (A.2). We normalize this equation by the number of outputs so the error is
not a function of the number of outputs, and the factor of two is simply to reduce the need of
multiplications later when the derivative is computed. Note that in the case of classification
problems as discussed in this paper, the target output would be 1 for the correct label, and
0 for every other label.
E :=1
Np
∑p
Ep (A.1)
Ep :=1
2No
∑m
(Om − Tm)2 (A.2)
78
A.1. Forward and Backpropagation 79
A given neuron’s output, as discuss earlier, is based on the activation function, and the net
is the weighted sum of the inputs plus the bias.
Oj := φ(netj) (A.3)
netj :=∑i
wijOi + bj (A.4)
The last definition needed is the updating of the weights, called gradient descent, where a
weight is updated in the opposite direction of its effect on the error computed. This update
has a magnitude and direction, the direction being the partial derivative and the magnitude
being some variable referred to as the learning rate η.
wij(k + 1) := wij(k)− η ∂E∂wij
(A.5)
If adaptive learning rate 0 < a ≤ 1 is used, then after each epoch the learning rate is updated
η(k + 1) := aη(k) (A.6)
To disable the adaptive learning rate, a is set to 1, keeping η constant.
If momentum α is used, then the ∂E∂wij
in (A.5) becomes ∂E∂wij
∗, a weighted average with its
history:
∂E
∂wij
∗:= α
∂E
∂wij
∗+ (1− α)
∂E
∂wij
(A.7)
To disable momentum, α is simply set to 0.
80 Appendix A. Mathematical Derivations
Using (A.1), ∂E∂wij
can be solved for:
∂E
∂wij
=1
Np
∑p
∂Ep
∂wij
(A.8)
And the partial ∂Ep
∂wijcan be solved for using the Chain Rule:
∂Ep
∂wij
=∂Ep
∂Oj
· ∂Oj
∂netj· ∂netj∂wij
(A.9)
The last two components are easily solved for.
∂netj∂wij
=∂
∂wij
(∑m
wmjOm + bj)
=∂
∂wij
(w0jO0 + w1jO1 + ...+ wijOi + ...+ wmjOm + bj)
∂netj∂wij
= Oi
(A.10)
∂Oj
∂netj=
∂
∂netjφ(netj) (A.11)
Note that this requires the activation function to be differentiable. Solving for the final
component, ∂Ep
∂Ojis more complex. If Oj is in the output layer, then its effect on Ep is easy
to calculate using (A.2):
∂Ep
∂Oj
=∂
∂Oj
(1
2No
∑m
(Om − Tm)2)
=1
2No
∂
∂Oj
((O0 − T0)2 + (O1 − T1)2 + ...+ (Oj − Tj)2 + ...+ (Om − Tm)2)
=1
2No
· 2(Oj − Tj)
∂Ep
∂Oj
=1
No
(Oj − Tj)
(A.12)
A.1. Forward and Backpropagation 81
However, if Oj is not in the output layer, then its contribution to the error is the sum of
its contribution to the error through each of the neurons in the layer above it that used its
output as one of their input, which in the case of an MLP, means all the neurons in the layer
above it.
∂Ep
∂Oj
=∑m
∂Ep
∂Om
· ∂Om
∂Oj
(A.13)
where m iterates through all the neurons of the parents layer. Solving for those components,
the Chain Rule can again be used to obtain
∂Om
∂Oj
=∂Om
∂netm· ∂netm∂Oj
(A.14)
where ∂Om
∂netmcan be solved using (A.11). ∂netm
∂Ojwas defined for neuron with output j being an
input to neuron m (as set up in (A.13), so it can easily be solved for by expanding the sum:
∂netm∂Oj
=∂
∂Oj
(∑n
wnmOn + bm)
=∂
∂Oj
(w0mO0 + w1mO1 + ...+ wjmOj + ...+ wnmOn + bm)
∂netm∂Oj
= wjm
(A.15)
where n iterates through all the inputs to neuron m, effectively iterating through all the
neurons in the layer of neuron j.
Combining (A.13, A.14, A.11, A.15):
∂Ep
∂Oj
=∑m
∂Ep
∂Om
· ∂
∂netm· φ(netm) · wjm (A.16)
82 Appendix A. Mathematical Derivations
Combining the equations above, we get the following:
∂Ep
∂wij
=∂Ep
∂Oj
· ∂Oj
∂netj· ∂netj∂wij
(A.9)
∂Ep
∂Oi
=
1No
(Oj − Tj) if Oj is in output layer∑m
∂Ep
∂Om· ∂Om
∂netm· wjm Otherwise
(A.17)
∂Oj
∂netj=
∂
∂netjφ(netj) (A.11)
∂netj∂wij
= Oi (A.10)
A.2 Softmax and Cross Entropy
When using softmax as the final activation function, a different error function from the one
defined in (A.2) is used. In Softmax, each output is defined as:
Oi :=eneti∑j e
netj(A.18)
with j iterating through all the neurons in the output layer. The error function replacing
(A.2) is defined as:
Ep :=∑j
tj logOj (A.19)
The error, as derived in (A.9) would only have its ∂Ep
∂Oj· ∂Oj
∂netjpart change in this case, so we
will derive that again with this new error function.
∂Ep
∂neti=
∑j
∂Ep
∂Oj
· ∂Oj
∂neti(A.20)
A.2. Softmax and Cross Entropy 83
where j iterates through all the neurons in the output layer. Solving for those two parts:
∂Ep
∂Oj
=∂
∂Oj
(−∑k
tk · logOk)
= −tj ·∂ logOj
∂Oj
=−tjOj
(A.21)
where O is a specific output, and k iterates through all neurons in the output layer, and
using the Quotient Rule:
∂Oj
∂neti=
∂
∂neti(
enetj∑k e
netk)
=(∂e
netj
∂netj)(∑
k enetk)− (enetj)(
∂∑
k enetk
∂neti)
(∑
k enetk)2
=(∂e
netj
∂netj)(∑
k enetk)− (enetj)(enetj)
(∑
k enetk)2
=
(eneti )(
∑k enetk )−(eneti )2
(∑
k enetk )2= Oi −O2
i i = j
(0)(∑
k enetk )−(enetj )(eneti )
(∑
k enetk )2= −OjOi i 6= j
(A.22)
which is achieved using the definition in (A.18).
84 Appendix A. Mathematical Derivations
Next, plugging this into (A.19) gives:
∂Ep
∂neti=
∑j
(−tjOj
(∂Oj
∂neti)
= [
assumesi 6= j for all cases︷ ︸︸ ︷∑j
(−tjOj
)(−OjOi) ]−
subtract wrong i = j︷ ︸︸ ︷(−tiOi
)(−OiOi) +
add correct i = j︷ ︸︸ ︷(−tiOi
)(Oi −O2i )
= [∑j
tjOi]− tiOi − ti(1−Oi)
= Oi
Sum of targets is 1︷ ︸︸ ︷∑j
tj −tiOi − ti + tiOi
= Oi − ti
(A.23)
So when using Softmax as the activation function on the output layer, the loss function
changes from (A.2) to (A.19), and when backpropagating, ∂Ep
∂neti, where neuron i is in the
output layer changes from (A.12) and (A.11) to (A.22). This concludes the derivation of the
math required to train a DNN.