Generic Object Tracking with NVIDIA Jetson Nano Using Siamese
Convolutional Neural Networks Generisk objektsföljning med NVIDIA
Jetson Nano med hjälp av siamesiska faltande neurala nätverk
Alexander Selberg
C E
N T
R U
M SC
IE N
T IA
R U
M M
A T
H E
M A
T IC
A R
U M
Abstract
In this thesis, a generic object tracker was constructed that was
applied to both a commonly used tracking dataset using a regular
computer as well as a robot powered by a small NVIDIA computer. The
architecture of the tracker consisted of two parallel convolutional
neural networks convolving to a single output. The input consisted
of two separate cropped images that were fed into the networks
separately. The images depicted an object from an image sequence at
time t and t + 1, both centered at the object at time t. The
purpose of the network is then to compare the two images and output
coordinates for the object’s position at time t + 1.
The tracker was successful in following several objects from a
commonly used visual object tracking dataset but performed
inconsistently for different scenarios based on its training time.
The size of the tracker became a problem when applying it to the
robot, requiring significant size reduction. This had a negative
effect on the trackers’ performance. The tracker managed to track
at up to 60 fps when used on the computer but only around 10 fps
for the robot. It’s likely that the tracking performance and speed
of the robot can be improved significantly by optimizing the
trackers neural network structure as well as adjusting its training
duration.
1
Acknowledgements
I wish to show my gratitude to all the people at SAAB c© Dynamics
AB that has been a vital part of shaping this thesis. First I’d
like to thank Mattias Helsing for providing the opportunity for me
to conduct my thesis at SAAB. I’m also very grateful to my
supervisor at SAAB, Bjorn Johansson for providing me with valuable
discussions and assistance during my time here. I’d also like to
offer special thanks to Gabriel Khajo and Richard Barkman at SAAB
for their interest in my work and invaluable help.
I’d also like to thank my supervisor at Lund University of
Technology, Anders Heyden for providing feedback and assistance
during my thesis. Finally, I’d like to thank my examiner, Kalle
Astrom for showing genuine interest in my work.
2
Contents
Contents 3
1 Introduction 5 1.1 Background . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . 5 1.2 Purpose and goal . . . . . . . . . .
. . . . . . . . . . . . . . . . . 7 1.3 Equipment . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . 7
2 Theory 8 2.1 Machine Learning . . . . . . . . . . . . . . . . . .
. . . . . . . . . 8
2.1.1 Polynomial Curve Fitting . . . . . . . . . . . . . . . . . .
8 2.1.2 Regularization . . . . . . . . . . . . . . . . . . . . . .
. . 11 2.1.3 Gradient Descent . . . . . . . . . . . . . . . . . . .
. . . . 12
2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . .
. . . 13 2.2.1 Activation Functions . . . . . . . . . . . . . . . .
. . . . . 14 2.2.2 Backpropagation . . . . . . . . . . . . . . . .
. . . . . . . 15 2.2.3 Dropout . . . . . . . . . . . . . . . . . .
. . . . . . . . . . 16 2.2.4 Batch Normalization . . . . . . . . .
. . . . . . . . . . . . 17
2.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . .
. . 17 2.3.1 Image Classification . . . . . . . . . . . . . . . . .
. . . . 18 2.3.2 Convolutional Layers . . . . . . . . . . . . . . .
. . . . . . 19 2.3.3 Pooling Layers . . . . . . . . . . . . . . . .
. . . . . . . . 20 2.3.4 Fully Connected Layers . . . . . . . . . .
. . . . . . . . . 21
2.4 Generic Object Tracking . . . . . . . . . . . . . . . . . . . .
. . . 22 2.4.1 Transfer Learning . . . . . . . . . . . . . . . . .
. . . . . 23 2.4.2 Data Augmentation . . . . . . . . . . . . . . .
. . . . . . 23 2.4.3 Region Overlap Score . . . . . . . . . . . . .
. . . . . . . 25
2.5 Network Architecture . . . . . . . . . . . . . . . . . . . . .
. . . 26 2.5.1 Depthwise Separable Convolutions . . . . . . . . . .
. . . 27 2.5.2 Inverted Residuals and Linear Bottlenecks . . . . .
. . . . 29
3 Methodology 31 3.1 Dataset . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . 31
3.1.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . .
. 31 3.1.2 Data Augmentation . . . . . . . . . . . . . . . . . . .
. . 31
3.2 Network Input / Output . . . . . . . . . . . . . . . . . . . .
. . . 32 3.3 Training . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . 32
3.3.1 Training Data Preparation . . . . . . . . . . . . . . . . .
32 3.3.2 Network Training . . . . . . . . . . . . . . . . . . . . .
. . 32
3
3.4 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . 33 3.5 Tracking Using the Jetbot . . . . . . . . . . . .
. . . . . . . . . . 34
4 Results & Discussion 35 4.1 Tracking Scenarios . . . . . . .
. . . . . . . . . . . . . . . . . . . 35 4.2 Dataset . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Tracking
Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.4 Tracking With the Jetbot . . . . . . . . . . . . . . . . . . .
. . . 48
5 Conclusion & Further Work 50 Bibliography . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . 50
4
1.1 Background
The importance of the computer when it comes to the advancements
made in modern society cannot be overstated. With it came
humanity’s ability to per- form calculations and to solve problems
that would otherwise be considered impossible. However, despite the
enormous capacity for problem-solving, com- puters still have a
hard time with several problems that we as humans consider trivial,
for example, speech and object recognition [43]. A person would
have no problem identifying a cat seen in a picture, even if the
cat was partly oc- cluded, a different color or breed than
previously known by the person. The brain is amazing at processing
visual information and using previous knowledge to quickly identify
new objects. An image according to a computer simply con- sists of
a bunch of numbers in a specific order as can be seen in Figure
1.1. For a computer to be able to make sense of this it has to be
able to somehow recognize the numbers to make something out of it
and this is where Artificial Intelligence comes in.
(a) Number 3 represented as a ma- trix (NMIST dataset)
(b) One color channel of an image of a truck represented as a
matrix (CIFAR10 dataset)
Figure 1.1
Artificial Intelligence, or AI for short, has been around as a
concept for a very long time. The thought of constructing machines
or robots that act and behave like intelligent beings can be found
as early as in ancient Greece [28] but
5
it wasn’t until the invention of the modern computer that this
dream suddenly appeared within reach. While some probably associate
AI with sentient robots and artificial humans the field is a bit
broader than that. AI can be defined as the development and study
of so-called intelligent agents [25]. An agent in this case can be
several things. For example, a thermometer, a dog or a human. Its
only definition is that it’s something that acts in an environment.
For an agent to be considered intelligent it has to be able to do
more than just act, it has to be able to adapt to new environments
and come up with solutions and learn from past mistakes. By
constantly receiving new feedback the agent is supposed to
continuously improve its performance. For this agent to learn and
adapt sufficiently it has to gather all the knowledge gained from
its experience and define it as a category of concepts By learning
a large amount of simpler concepts and from them construct more
complex concepts it can gain a better understanding of its
environment. By visualizing all these concepts in a graph, built on
top of each other in layers, the graph would be considered deep,
and this approach to AI is what’s considered Deep Learning [11], a
subset of Machine Learning.
Machine Learning can be defined as a learning process where the
computer through repeated exposure and gained experience learns to
recognize patterns and important features for different kinds of
problems. This can be anything from learning to differentiate
between spam and non-spam emails, creating an automatic customer
system or identifying cats in images [1]. The last-mentioned
problem is called an image classification problem and is one of the
more difficult problems for a computer to solve simply because of
the huge amount of difference there can be between two pictures
depicting the same thing. By allowing the computer to train on a
large dataset consisting of images with specific objects and
constantly providing feedback to the computer on its current
performance, the computer will eventually learn which features are
important for different objects and hopefully when presented with a
picture of a cat the computer will then correctly classify it as a
cat. For this object classification to be successful a huge dataset
is often needed for the computer to train on and that in turn
requires a large computational capacity. Recent years have resulted
in massive progress in this area, much thanks to the increase in
computational capacity by modern computers. For this kind of
problems, Deep Learning can be an effective approach.
Deep Learning in image classification has had a resurgence in
recent years when a deep learning model called AlexNet in 2012
outperformed the current state-of-the-art image classifiers using a
deep neural network [24]. The objec- tive was to classify 1.2
million high-resolution images into a set of 1000 different
classes. The key to their success, they argue, is the access to
very large datasets such as ImageNet, consisting of over 15 million
high-resolution images with roughly 22.000 categories [8] and their
ability to construct such a large network thanks to GPUs. Graphics
Processing Units (GPUs) are widely used in video games and thanks
to its huge market and competition it has led to continu- ous
performance improvements and driven down prices significantly. It
turns out that their ability to quickly calculate vector and matrix
multiplications in parallel is beneficial to training neural
networks and superior to the previously used Central Processing
Unit (CPU) [34].
6
1.2 Purpose and goal
In this project the objective is to construct a generic object
tracker and imple- ment it with a SparkFun JetBot AI Kit powered by
an NVIDIA Jetson Nano Developer Kit computer [37] [18]. Often when
constructing object trackers, the object that is supposed to be
tracked is known from the start. It might be a tracker whose
purpose is to follow football players around the field or a tracker
that tracks the cars during a Nascar rally. For these trackers it
is sufficient for their models to simply train on one specific
object. The purpose of a generic object tracker is to be able to
track any object without any prior training on that specific
object. Tracking an object consists of knowing its current location
at any time during an image sequence and for a generic object
tracker to be able to do that it must therefore either be able to
accurately predict the objects next location or continuously detect
the object for each given frame.
There are two common approaches when constructing object trackers;
online and offline trackers. An online object tracker operates
online at all times and continuously learns new features and
adjusts current features. This can result in a very accurate
tracker that has little problem with tracking an objects trans-
lation and scaling differences but the obvious downside is the
computational effort it takes to constantly recalculate the
tracking parameters for every image frame [13].
An offline tracker is the complete opposite, instead of learning
new features as it runs, it completely relies on its pretrained
model. This requires far less computational effort and can
therefore prove beneficial when using the NVIDIA device or other
embedded devices. It can still acquire a high accuracy and its
evaluation at test time is very fast but the issue lies in its
inability to adapt to new situations and its performance will
decrease significantly with more difficult tracking scenarios such
as large occlusions, significant appearance changes, etc
[13].
The goals of this thesis can be divided into three major
sections:
• Creating a generic object tracker that can learn to track objects
through image sequences found in commonly used datasets or
personally created image sequences.
• Applying the tracker alghorithm to the Jetbot and attempt to
track real- life objects using its built-in camera.
• Investigate and demonstrate the capabilities and limitations of
the NVIDIA device.
1.3 Equipment
The equipments used for this thesis are a computer with an NVIDIA
Quadro P4000 Graphics card and a SparkFun Jetbot AI Kit powered by
NVIDIA Jetson Nano Developer Kit. The technical specifications for
the NVIDIA card can be found here [18].
2.1 Machine Learning
In machine learning, the goal is to allow a computer system to
improve its performance on a task through training. A commonly used
definition of machine learning is: ”Improving some measure of
performance P when executing some task T, through some type of
training experience E” [29]. This section aims to provide an
understanding for how a computer system can be constructed that
automatically improves through experience. A simple regression
problem, polynomial curve fitting can be used to introduce several
of the key concepts in machine learning.
2.1.1 Polynomial Curve Fitting
Polynomial curve fitting is a regression problem first encountered
in statistics. The concept is simple, based on a set of
observations x ≡ (x1, ..., xN ), construct a polynomial that can
accurately predict corresponding observations of the value t ≡ (t1,
..., tN ) [4]. The polynomial takes the form
y(x,w) = w0 + w1x+ w2x 2 + ...+ wMx
M =
M∑ j=0
wjx j , (2.1)
where M is the order of the polynomial. These regression
coefficients wj are usually refered to as weights in ML. The
weights will be determined by trying to fit the polynomial to the
observed set which is normally done by minimizing a so called loss
function. Loss in this case is defined as the difference between
the predicted value y(xn,w) and tn and can be visualized as the
green bars in Figure 2.1.
The goal is to minimize the loss for all the points in the set
which is done by combining all the losses to a loss function. There
are several loss functions that can be used, one of the more
commonly used is called Mean Square Error (MSE), often refered to
as L2 loss and is defined as
L2(w) = 1
Figure 2.1: Visualization of the loss, figure taken from [4]
L2 loss is a quadratic function of the coefficients w and therefore
its derivatives will be linear which concludes that there must be a
unique solution that mini- mizes the loss function [4]. In fact,
for a set containing N points there can always be found a perfect
solution, resulting in no loss, from a polynomial of order M = N-1
since the polynomial will contain N degrees of freedom
corresponding to the weight coefficients (w0,...,wn)T [4]. At first
glance this might seem like the optimal solution and while the goal
was to minimize loss, the main objective is to predict a hidden
pattern or function that the observations x stem from. Naturally,
if the observations x were indeed taken from a specific function at
different values, the perfect solution, with zero loss, would be
the specific func- tion they were taken from, not necessarily of
the order M = N - 1. However, almost all observations include some
noise that will have an impact on its value.
The hidden function is not necessarily a polynomial function, and
the range might be limited. The 10 observations in Figure 2.2 comes
from the function sin(2πx), spaced uniformly in range [0,1] with a
small amount of Gaussian dis- tributed noise [4].
Figure 2.2 shows the observered dataset x as the blue dots, the
hidden function sin(2πx) as the green curve and the fitted line as
the red curve for polynomials of order M = 0, 1, 3 and 9. M = 3
best fits the green curve while M = 0 and 1 are both poor
approximations of the hidden function and while M = 9 produces zero
loss it is also a poor approximation. Several other observations
within the range would produce massive loss and this is a common
problem within ML called overfitting, where the function or model
is trained too heavily towards the training data, achieving great
results for the loss function in regards to the training data but
will perform worse for any other data taken from the hidden
function. This highlights the importance of dividing the data into
different parts.
Another loss function that’s commonly used is the Mean Absolute
Error (MAE), also refered to as L1 loss as is defined similar to
the L2 loss but with absolute value of the loss as
L1(w) = 1
N∑ n=1
|y(xn,w)− tn|. (2.3)
While L2 loss more heavily punishes outliers and greatly rewards
small losses, it
9
Figure 2.2: Curve fitting for different order of polynomials,
figure taken from [4]
can sometimes lead to small errors not being penalized enough. L1
loss penalizes smaller errors further which can sometimes be
beneficial [15].
Training, validation and test dataset
The dataset, previously refered to as the observered dataset is
often divided into three parts:
• Training dataset: The data used in the loss function, usually
around 80% of the data.
• Validation dataset: Data not used in the loss function,
continuously monitored usually around 10% of the data.
• Test dataset: Previously unseen data, evaluated at the end of the
train- ing, usually around 10-20% of the data.
A machine learning models strength lies in its ability to
generalize and work on previously unseen data for the same task.
Outstanding performance on the training set don’t necessarily
result in great performance otherwise, as pre- viously seen in
figure 2.2. The goal when training machine learning models
therefore becomes to perform well on the test dataset which is only
evaluated at the end. The validation dataset can provide a hint for
when it’s time to stop training the model.
Figure 2.3 shows the loss for the training and validation set as a
function of the training time. Initially, both losses are steadily
decreasing but after a
10
Figure 2.3: Illustration of overfitting
while the validation loss starts to increase, while the training
loss keeps getting smaller. This can indicate that the model is
overfitting to the training dataset and should be a reasonable time
to stop training.
2.1.2 Regularization
The problem with overfitting shows that its necessary to take into
consideration more than just the loss and one way to reduce
overfitting is to incorporate a minimization of the complexity
together with the loss function. This is called regularization and
is done by adding a penalty term to the loss function (2.2) in the
following way, so that the magnitude of the weights will contribute
to the new modified function
L = 1
where ||w||2 = wTw = w2 0 +w2
1 + ...+w2 M [4], this is called L2 regularization and
could have other forms aswell [12]. λ is a value that will
determine how much the complexity will be encouraged, a higher λ
will strengthen the regularization effect and encourage smaller
weights while a small λ will allow more complexity and larger
weights.
11
2.1.3 Gradient Descent
When trying to reduce loss for a function, an iterative approach is
mostly used in practical applications. Perhaps the simplest and
most common method is called gradient descent. Observing a curve
like the one in Figure 2.4, the process of moving from the starting
point to the next point in the plot shows one step of the iterative
process. The starting point can be the result of choosing a weight
value at random. The gradient for the losscurve is then calculated
and since the idea is to reduce the loss, a step is taken towards
the direction of the negative gradient and the weight value is
updated accordingly. This is done repeatedly till reaching a
satisfying loss value, ideally close to the bottom of the curve.
The size of the step is decided based on both the magnitude of the
gradient aswell as an arbitrary step size, or learning rate [12].
The learning rate is part of a set called hyperparameters.
Hyperparameters are values that are determined before the iteration
begins and usually needs to be thoroughly tweaked and examined
before the model can achieve good results. In this example a
learning rate that’s too large would risk stepping over the entire
bottom section, returning an even higher loss and never converging
towards a desired loss while a learning rate that’s too small would
theoretically eventually reach the minimum loss but in practice
this can take a very long time [12].
Figure 2.4: Illustration of a convex loss curve, figure taken from
[12]
The loss curve in Figure 2.4 is convex, meaning there exists only
one mini- mum value for the loss function. Usually this is not the
case, and there can be many local minima that gradient descent
risks getting stuck in.
When using gradient descent the weights are updated as
w(τ+1) = w(τ) − η∇L(w(τ)), (2.5)
where η > 0 is the learning rate [4]. The loss function here is
defined for the entire training set, or batch, a method known as
batch gradient descent [4]
12
which can prove troublesome when working with very large training
datasets since the loss function is calculated individually for all
training examples as
L(w) =
Ln(w), (2.6)
where N is the number of training examples. One method to reduce
the com- putational burden is simply by updating the weights based
on just one training example, this is a form of gradient descent
called stochastic gradient descent, or SGD. The term stochastic
comes from the example being chosen at random [12]. Here the
computation becomes much quicker but also more noisy and ir-
regular due to its stochastic nature [14]. The new gradient descent
will jump around more and change directions but the idea is that on
average it will work its way down the loss curve. Its irregular
behaviour also has the added bene- fit of sometimes escaping local
minimas, preventing it from converging towards higher losses. The
weights are now updated as
w(τ+1) = w(τ) − η∇Ln(w(τ)). (2.7)
Both examined examples of gradient descent can be seen as extreme
examples since the number of examples used are either all or simply
one. A natural compromise would be to use a smaller batch of
examples, still reducing the computational burden and resulting in
quicker calculations and convergence than batch gradient descent.
This method is known as mini-batch gradient descent [14] and the
reason why this is usually preferred to SGD has to do with it being
slightly less erratic and irregular, but also the fact that it is
more computationally efficient to calculate the gradient once for
100 examples, than 100 times with one example [7]. The number of
examples used in mini- batch gradient descent is known as batch
size and is also a hyperparameter like the previously mentioned
learning rate. Choosing a suitable batch size is usually done
simply by trying out different values and observing the models
performance. Its common to choose batch sizes of the power of 2
such as 32, 64, 128 because in practice, many vectorized operation
implementations work faster when their inputs are in the power of 2
[7].
2.2 Neural Networks
Neural networks as a concept has existed since the 1950s [5] and
although both inspired and often likened to information processing
in biological systems its similarities are usually exaggerated. In
ML, a neural network consists of three types of layers, the input
layer which handles the input to the network, one or several hidden
layers which sums together the inputs from the previous layer, each
individually multiplied with associated weights and finally
multiplied with an activation function, and at last, an output
layer which produces the networks output. Figure 2.5 shows a simple
neural network with two inputs, three neurons in the single hidden
layer and two outputs aswell as two bias terms x0 and h0. The grey
lines symbolize a weight multiplication and the arrow denotes in
which order the calculations are executed. In this figure all the
grey lines point in the right direction, such neural networks,
where all operations occurs in the same
13
x0
x1
x2
Figure 2.5: Neural network containing one hidden layer
direction from the input to the output layer are called
Feed-Forward networks [4].
The input is summed together in each hidden neuron together with
its re- spective weights as
aj =
w (1) ji xi, (2.8)
where j = 1,...,M, x0 = 1 and M is the amount of hidden neurons in
the layer excluding the bias. The superscript (1) refers to the
first hidden layer of the network. The quantities aj are refered to
as activations [4] In the provided example there is only one hidden
layer but there can also be several. Regardless of how many layers
are being used, their intended purpose is not yet clear. So far,
the output of the network assumes a linear correlation with the
inputs, the hidden layer has just provided a broader range of
linear combinations of the inputs but the network is unable to
handle nonlinear problems. The solution to this lies in the usage
of activation functions.
2.2.1 Activation Functions
Activation functions are applied after each hidden layer and
transforms the activation in (2.8) to a nonlinear function. The
purpose of activation functions is to introduce complexity into the
network and allow it to model more complex and nonlinear problems.
The output of each hidden node then becomes
zj = σ(aj) = σ
( D∑ i=0
) , (2.9)
where σ is the activation function. Typical activation functions
consists of the rectified linear unit activation function ReLU and
the Sigmoid function.
Figure 2.6 shows the two most common activation functions. ReLU
retains only positive input and discards all the negative input by
setting them to zero. Despite its simplicity it has proven to often
provide the greatest results while still being very simple to
compute [24]. The Sigmoid function instead converts all activations
to values between 0 and 1. Which activation function to use
is
14
(a) ReLU function: F(x) = max(0,x) (b) Sigmoid function: F(x) = 1
1+e−x
Figure 2.6: Two common activation functions. Made in python using
matplotlib
more often decided based on what works best rather than some fixed
set of rules [12].
Combining the outputs from the hidden layer from (2.9) with the
last set of weights the output of the model then becomes
yk(x,w) =
) (2.10)
where the bias term w (2) k0 is the output for j=0.
The goal is to find suitable values for the weights that will
produce good output values. Good here means to predict values as
close to the ”true” values as possible, or minimizing the Loss
function. The way this is done in neural networks is through
backwards propagation.
2.2.2 Backpropagation
The important contribution of the backpropagation technique is its
ability to calculate the gradient ∂L
∂wij for all the weights in a computationally efficient
manner [4]. The Loss function for the derivation of backpropagation
is chosen to be L = 1
k
∑ 2k(y(xk,w)− tk)2 for simplicity’s sake. This results in ∂E
∂y = yk− tk. The chain rule can now be used to calculate the loss
with respect to the
weights connected to the output layer as
∂L
∂wkj =
∂L
∂L
15
Same type of calculations can then be used for all weights in the
network before updating them and running another forward pass.
Continuously doing this will then hopefully result in a more
accurate network. This also illustrates another benefit with using
the previously mentioned activation functions ReLU and Sigmoid, its
derivatives are
∂(ReLU(x))
∂S(x)
2.2.3 Dropout
Dropout is an extremely effective, simple regularization technique
often used in neural networks [7]. The way it works is by applying
a probability p for staying active to each node in the hidden
layers of the network. A p-value of 0.5 would mean each node had a
probability of 50% at staying active, and a probability (1− p) =
50% of being set to zero. The process can be illustrated in Figure
2.7.
Figure 2.7: Dropout being a applied to a neural network, figure
taken from [38]
It might appear counterintuitive to simply remove certain nodes
from the network based on some probability but it has been shown to
be very effective at preventing overfitting and also speeding up
training notably [38]. The idea is to force the network not to rely
too heavily on certain nodes that can have a strong adaptation to
the training set. Dropout is usually only applied to the network
during training time and is not used during testing [7].
16
2.2.4 Batch Normalization
Batch normalization is a commonly used technique that can be used
to improve the speed, performance and stability of the network
[17]. Its purpose is to stabilize the distribution of layer inputs.
This is achieved by introducing new network layers that control the
mean and variance of these distributions. The widespread
understanding of its success comes from its assumed reduction of
the internal covariate shift (ICS) [17]. ICS refers to the change
in distribution of some layer input caused by updates to the
preceding layers [33]. New evidence points at other underlying
reasons for the success of batch normalization, such as smoothening
of the optimization landscape, leading to more predictive and
stable behaviour from the gradients, allowing for faster and more
stable training [33].
Regardless of the exact reason behind the success of batch
normalization, the method is commonly used with great success in a
large variety of networks [33] and the process is rather simple.
For each activation x(k) in the network, the so-called Batch
Normalization Transform (BN) can be applied to mini-batches of the
training set, similar to the mini-batch gradient descent in 2.1.3.
The input consists of values x for a single activation over a
mini-batch: B = {x1 . . . xm} with two parameters γ and β to be
learned. The transform goes as
µB ← 1
, (2.16)
yi ← γxi + β ≡ BNγ,β(xi), (2.17)
which produces the output {yi = BNγ,β(xi)} where ε is a constant
added to the mini-batch variance for numerical stability, µB is the
mini-batch mean and σ2
B is the mini-batch variance [17].
2.3 Convolutional Neural Networks
A convolutional neural network, or CNN is a class of deep neural
networks com- monly used when dealing with visual tasks such as
image classification, object tracking or semantic segmentation
[42]. Deep learning resurfaced in 2012 when a CNN was used at the
ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [24]
[31] which greatly outperformed all other contestants. The pur-
pose of the challenge was to classify an object in an image for 10
million high resolution images consisting of 1000 different
classes. One evaluation was to compare the 5 most likely classes
according to the models and see if any pre- diction matches the
ground truth. The network used in [24] was called AlexNet and
achieved a top-5 test error rate of 15.3% compared to the second
best en- try at 26.2%. Following this remarkable improvement
numerous other CNNs have appeared, further improving and breaking
new records for image related problems.
17
2.3.1 Image Classification
Image classification is the task of assigning an image input a
label from a fixed set of categories [7]. It’s one of the core
problems in Computer Vision and there exist several different
approaches for how to solve them [30]. An image usually consists of
a three dimensional array (w, h, d) = (w, h, 3) where w is the
width of the image in pixels, h the height and d the depth which
are the three red, green and blue (RGB) color channels. The pixels
(elements) of the array are integers ranging from 0 (black) to 255
(white) [7]. This means that an 256x256 pixel image will have 256 ∗
256 ∗ 3 = 196608 integers. Using large datasets will thus result in
massive amounts of data needing to be processed but is crucial for
the performance of deep neural networks.
The process of achieving image classification using CNNs can be
described in a very basic way using a few steps. First the
architecture of the network must be constructed with the correct
input and output specifications. The possible outputs are all the
specified classes and can be either a simple 1 for the predicted
class and 0 for the rest or a probability percentage for each
class, adding to 1. The network is then fed, or trained with the
entire training set of images and given the correct output. The
layers in the network will discover and save features from all the
images with the idea that some combination of features strongly
correlate to a specific class so that when an image is fed into the
trained network the network will now be able to predict the correct
label. This process can be visualized in Figure 2.8 and 2.9.
Figure 2.8: Convolutional neural network being trained with images
of cats and dogs with ground truth labels
While Figure 2.8 implies that there were only a total of 12 images
used for training, normally training datasets are much larger. The
dataset Dogs vs. Cats [9] consists of 25.000 training images of
dogs and cats.
18
Figure 2.9: Test image fed into the network with correct
prediciton
2.3.2 Convolutional Layers
Convolutional layers are the main building blocks in CNNs. They
consist of a number of two-dimensional feature maps whose size is
dependent on several factors: the input size, the filter size,
whether padding is enabled or not and the size of the stride. The
input size is the previous layer, for example the image of size
(256x256x3), the filter is a small three-dimensional matrix,
usually in size 3x3x3 or 5x5x3. The depth corresponds to which
filter belongs to which color channel. The filters are then
multiplied for each depth using dot product from one corner of the
matrices, sliding, or convolving across the matrix. The dot product
for each step for all depths are added together and produce one
piece of the feature map as output [7]. The number of filters used
corresponds to the final amount of feature maps. The first step of
the process can be seen in Figure 2.10.
The figure shows an image of size 5x5x3 with zero-padding enabled.
Zero- padding is when adding an outer shell to the input matrices,
consisting solely of zeros. Each channel is being dot multiplied
with a corresponding depth for each filter with an added bias term
for the output. The filter is moved 2 steps to the right for each
step. More specifically it’s the first step that’s highlighed and
the calculations done are0 0 0
0 1 0 0 2 1
· 1 −1 −1
+
+
· 0 1 −1
+ 1 = −2.
One of the main advantages using this method compared to the
previously mentioned calculations in regular neural networks is the
considerably smaller amount of computations required. In regular
neural networks all neurons in a layer are connected with every
neuron in adjacent layers. For an image input of size 256x256x3
were every pixel represent a node, would result in 256∗256∗3 =
196608 weights for every single node in the next layer. Usually,
way more neurons are necessary and therefore would result in a
massive amount of weights
19
Figure 2.10: Convolutional layer with two filters W0 and W1,
zero-padded with a stride of 2 applied to an image of size 5x5x3
producing two feature maps of size 3x3, figure taken from [7]
needing to be continuously calculated and updated. For
convolutional layers each element for each filter has its
individual weight meaning that the amount of weights is only
dependent on the size of the convolutional layer. A convolutional
layer with filters of size 3x3x3 with depth 64, or 64 feature maps
instead only has (3 ∗ 3 ∗ 3) ∗ 64 = 1728 weights attached to
it.
2.3.3 Pooling Layers
In CNNs its common to periodically insert a pooling layer
in-between successive convolutional layers, the idea is to
progressively reduce the spatial size of the network to reduce the
amount of parameters and computational burden aswell as preventing
overfitting [7]. A common pooling filter to use is an 2x2 filter
with stride 2 which will discard 75% of all activations which can
be seen in Figure 2.11. The principle is similar to the
convolutional layers with sliding filters but instead of performing
the matrix dot multiplication the pooling filter will instead
choose a single value based on the type of pooling layer used. The
most commonly used is maxpooling which will pick the largest value
in the filter and discard all the rest.
20
Figure 2.11: Maxpooling with an 2x2 pool filter and stride 2
This approach might appear unintuitive at first, discarding a large
amount of potentially important activations based on some arbitrary
approach. However, when dealing with images, adjacent pixels
usually show a strong correlation so its possible to reduce the
resolution without losing the distingushing features and
patterns.
2.3.4 Fully Connected Layers
The last layers of CNNs usually consist of a single or several
fully connected layers (FCL). These layers work the same as the
layers in regular neural networks where every node in the layer is
connected to every node in the adjacent layers. For large FCLs they
often represent almost all parameters in the entire network and is
therefore responsible for fitting complex nonlinear discriminant
functions in the feature space into which the input data elements
are mapped [2]. For example, the previously mentioned AlexNet has
60 million parameters with 58 million belonging to the last three
FCLs [24].
21
Figure 2.12: The network architecture of AlexNet, taken from
[24]
Figure 2.12 shows the network architecture of the original AlexNet.
The reason why it had two parallel networks has to do with the
limitations in GPU memories back in 2012 when the paper was
released [24] and therefore two GPUs were used. A more commonly
used architecture these days is called CaffeNet and is the
concatenated version seen in Figure 2.13.
Figure 2.13: The network architecture of CaffeNet, taken from
[26]
This network consists of an input image with size 224x224x3, 5
convolutional layers and 3 fully connected layers with the last
layer representing the number of classification categories in the
Imagenet Large Scale Visual Recognition Com- petition [31]. It’s
still widely used as part of the network for several recent state
of the art applications [13] [15] [39].
2.4 Generic Object Tracking
The goal with generic object tracking is rather easy to formulate.
Based on solely an initial set of coordinates for a bounding box
encompassing the desired, arbitrary object in the initial image
frame, predict the objects location for all future frames [23].
Current generic object trackers predominantly rely on learn- ing
its tracking online, meaning they run and update in real-time,
detecting the
22
object for each frame and updating with regard to possible
appearance changes occuring [13]. This can produce very accurate
long-term trackers that are robust to occlusions, appearance and
lighting changes with the added downside of only allowing more
simpler models with slower run-time due to the computational effort
it takes to constantly update the tracker in real-time [3]. An
alternative offline approach would be to pre-train the entire model
on a large dataset and locking all the parameters in run-time to
allow for very fast tracking [15]. This method might also be
beneficial when used with mobile or embedded devices which has a
tighter computational constraint.
2.4.1 Transfer Learning
Transfer Learning is the idea of using knowledge obtained from a
previous prob- lem and applying it to a new problem [41]. For
object tracking its possible to begin with a model trained on image
classification before training it for track- ing. The idea is that
the pretrained network will provide the new model with some
underlying understanding of the appearance of different objects, if
the pretrained dataset is large and diverse enough the feature maps
learned can act as a generic model of the visual world with
potential applications for a wide range of computer vision related
problems [5].
2.4.2 Data Augmentation
One fundamental characteristic of deep learning is the access to
large datasets. Since its the responsibility of the network to find
the important features instead of manual feature engineering, a
large training dataset is thus needed, especially for
high-dimensional inputs such as images [5]. Datasets used for
training a network can occasionally be limited in their size,
perhaps resulting in insufficient data for training a network to a
higher capacity [35]. This further risks resulting in overfitting
the network to the training data. One effective way around this is
augmentation of the data. This can easily be done by translations,
scaling, rotating or applying many different methods to an image
from the training set. An example of translation with scaling and
horizontal mirroring can be seen in Figure 2.14.
23
Figure 2.14: Data augmentation of an image of a cat
By transforming the image, the computer is led to believe its
exposed to new images which will help prevent overfitting and can
lead to a great increase of the previous datasets size. When
training AlexNet, the researchers increased their training set by a
factor of 2048 through data augmentation [24].
To create an artificial motion for an image, image transformation
can be used. By assuming that the subsequent image frames for an
image sequence are taken in a small time interval during object
tracking, it’s expected that the object most likely have moved very
little relative to its previous position. One way to imitate this
movement could be to apply a translation based on the width and
heigth of the object multiplied with a laplace distribution such as
in [15]
c ′
y = cy + h ∗y,
where w and h is the width aswell as the height of a cropped image
containing the object and x & y which can be modeled with a
laplace distribution with mean 0. Accounting for potential size
changes or translations, the width and height can also be
transformed in a similar manner
w′ = w ∗ γw, h′ = h ∗ γh,
where γw and γh are laplace distributed variables with mean 1. This
means that the highest probability is that the width and height
remains the same but also allows for potential size changes.
A variable is said to have a laplace distribution if its
probability density function is [22]
f(x;µ, b) = 1
24
where µ ∈ (−∞,∞) and b > 0 are location and scale parameters,
respectively [22]. The probability density function is visualized
in Figure 2.15
Figure 2.15: Probability density functions for different parameter
values, figure taken from [40]
2.4.3 Region Overlap Score
One way to measure the performance of the tracker other than simply
visually evaluating its tracking, is to calculate its region
overlap, often refered to as accuracy.
25
Figure 2.16: Illustration of the idea behind region overlap
measurements. The red box depicts the predicted location of an
object while the green box is the objects actual location
Figure 2.16 shows a simple illustration of a predicted bounding box
in red, overlapping with the ground truth bounding box in green. FP
stands for false positive, which is the area that the tracker
believes to be the object, while it’s not. TP stands for false
positive and is the area where the tracker predicts the objects
location correctly. FN stands for false negative and is the part of
the object that the tracker fails to predict.
An evaluation of the trackers performance can then be created
as
TP
TP+FP+FN . (2.19)
A perfect performance where the predicted bounding box is in the
exact same position as the ground truth bounding box would
therefore result in FP=FN=0 and 2.19 resulting in 1. In the same
way the performance would equal 0 if the area of the true positive
would be 0.
2.5 Network Architecture
The network is constructed with two pre-trained parallel
MobileNetV2 networks [32] ending in two FCLs as can be seen in
Figure 2.17. This network architecture, with two parallel,
identical networks with the same weights can be refered to as a
siamese neural network.
26
Figure 2.17: Network architecture with two cropped images as input
and bound- ing box coordinates as output
The input to each network is a cropped section of an image taken at
time t and t + 1 centered at the object of interest at time t. The
assumption is that objects move smoothly through space and the
previous position should then be a reasonable place to look for the
object’s current position. The network should also develop an
understanding of typical movement without including too much of the
background. The output of the network will be the coordinates for
the upper left and lower right corner of the bounding box capturing
the object. Figure 2.17 shows a cross-country skier moving forward
at a pace almost surpassing the cropped region. The position of the
skier at time t+1 is then predicted and given as output. It’s
important that the subsequent frames are close enough in time and
that the cropped regions are large enough so that the object has
not moved outside the cropped region at time t+1.
The choice of the MobileNetV2 network is based on it being adapted
to mobile or embedded devices through using less memory and
computations than more conventional networks, mostly thanks to its
implementation of depthwise separable convolutions.
2.5.1 Depthwise Separable Convolutions
MobileNet aswell as MobileNetV2 are two models based on depthwise
separable convolutions which is a form of convolution factorized
into two separate parts, the depthwise convolution and the
pointwise convolution [16]. The reason why such factorization might
be desirable is the low computational effort compared to the
standard convolutional layers in 2.3.2. The structure of the
factorized parts are illustrated in Figure 2.18.
27
Figure 2.18: Standard convolutional layers being factorized into
depthwise and pointwise convolution, figure taken from [16]
Assuming a zero-padded input layer using a stride of one of size DI
x DI x M where DI represent the input width and height and where M
are the usual three color channels in the case of images, with N
number of filters of size DK x DK x M, a standard convolutional
layer will perform a total of
DI · DI ·M ·N · DK · DK (2.20)
computations. With depthwise separable convolutions, depthwise
convolutional filters are
first used which are filters of size DK × DK × 1. There is only one
filter for each input channel which results in there being a total
of DI · DI ·M · DK · DK computations in the first step. This
process filters each input channel into a new output feature map.
The second step then creates a linear combination of each depth
channel to produce new features through pointwise convolution.
Convolutional filters of size 1 x 1 x M are applied to the output
feature map from step one to create a final output feature map of
size DI×DI×N where N is the
28
number of pointwise convolutional filters used. The number of
computations of the second step amounts to M ·N · DI · DI leading
to a total of
DI · DI ·M · DK · DK +M ·N · DI · DI (2.21)
computations. This often results in significantly less computations
than for the standard convolution in (2.20). The relation between
(2.20) and (2.21) becomes
DI · DI ·M · DK · DK +M ·N · DI · DI DI · DI ·M ·N · DK · DK
= 1
N +
1
(2.22)
which results in between 8 or 9 times less computations for the
depthwise sepa- rable convolutions of size 3×3 used in MobileNet at
only a small loss in accuracy than for the equivalent standard
convolutional layer [16].
2.5.2 Inverted Residuals and Linear Bottlenecks
MobileNetV2 predominantly consists of inverted residual blocks with
linear bot- tlenecks, more simply referred to as bottlenecks [32].
Its structure can be seen in Figure 2.19 where a 1× 1 point-wise
convolution is performed for each input channel k, transforming the
low-dimensional tensor into a higher-dimensional space. Then a
ReLU6 activation is applied before performing depth-wise convo-
lution using 3×3 filters followed by another ReLU6 activation.
ReLU6 is similar to the previously discussed ReLU but instead has
an upper limit of 6, ReLU6 = max(0,min(x, 6)) instead of the usual
ReLU = max(0, x). It’s used due to its robustness when dealing with
low-precision computations [16]. Finally an- other 1×1 point-wise
convolution is performed, projecting the feature map back to a
lower-dimensional tensor. Due to the inevitable information loss
occuring from the last projection, empirical studies have shown
that its important that a linear activation is used for the last
layer as to prevent destroying too much information [32]. A skip
connection between the bottlenecks, or shortcut is also implemented
to allow the gradient to propagate through multiple layers.
Figure 2.19: Inverted residual block with linear bottleneck, figure
taken from [32]
The idea behind this network architecture is built on the
presumption that the information from a set of layer activations
actually lie in some manifold,
29
which in turn is embedabble into a low-dimensional subspace [32].
The thin layers in Figure 2.19 represent those subspaces, allowing
fewer computations and less memory intensive networks than more
conventional networks, while also retaining the important
information for the network to achieve comparable results.
The architecture for the MobileNetV2 network can be seen in Figure
2.20 where bottleneck refers to the layer in Figure 2.19. The
expansion rate is given by t, the factor which the amount of
channels are increased by from the first step in the bottleneck
procedure. The number of output channels is denoted by c, n shows
how many times the same layer was repeated in a sequence, s denoted
the stride.
Figure 2.20: Network architecture of MobileNetV2, figure taken from
[32]
30
Methodology
The tracker was built in Python using Keras [6], the open-source
neural-networks library built on top of Tensorflow [27], an
open-source library for Machine Learn- ing applications. Keras
provides a framework for constructing the necessary components for
using neural networks. A NVIDIA Quadro P4000 GPU was used to train
the tracker.
3.1 Dataset
The image sequences and bounding box annotations used when training
this network comes from the Amsterdam Library of Ordinary Videos
300++ dataset (ALOV) [36]. The dataset consists of approximately
90000 frames from 314 video sequences ordered in 14 different
categories aimed to cover a diverse set of circumstances such as
illumination, transparency, specularity, confusion with similar
objects, clutter, occlusion, zoom, severe shape changes, motion
patterns, low contrast images and more [36]. The image sequences
comes from short videos with an average length of 9.2 seconds
ranging to a few minutes at most. Every fifth video frame is
annoted by a ground truth rectangular bounding box.
The part of the network excluding the fully connected layers are
pre-trained on ImageNet [31] before freezing the associated
weights. This means that only the weights connected to the fully
connected layers will be adjusted during training.
3.1.1 Data Preprocessing
The mean value for each color channel in the ImageNet training set
is first subtracted from all the image training data before
normalizing the data by dividing all pixels by 255, the largest
pixel value.
3.1.2 Data Augmentation
The training dataset is increased by a factor 10 using data
augmentation. Each image crop is transformed using the method from
section 2.4.2 to produce the second image crop to use as input. If
the augmented image had a width or height 40% larger or smaller
than the original image, the augmented image got
31
discarded. The same was done for the augmented images where the
objects center had moved more than half the original images width
or height. This was done to prevent the network from being exposed
to unrealistic scenarios or excessive movements.
3.2 Network Input / Output
The network architecture consists of two parallel MobileNetV2
networks [32] concatenated into two fully connected layers with
2048 nodes each. The input to each network consists of one image
crop each with size 224 x 224 x 3, from two subsequent timesteps,
similar to [15], [13], [20], [10]. The crops are taken from an
image sequence at time t and t + 1. The positioning of the cropped
section for both crops are centered at the object at time t with an
width and height twice the size of bounding box width and height.
The output of the network will then be the bounding box coordinates
for the object at time t + 1.
It’s important that the time between two subsequent image frames
are small enough that the object has not become too occluded or
moved far enough to be outside the crop at time t + 1.
3.3 Training
3.3.1 Training Data Preparation
The ALOV dataset used for training consists of large images of
different sizes with corresponding bounding box coordinates for a
portion of the images. The images annoted by ground truth
coordinates were read and the objects within the images were
cropped with twice the width and size of the ground truth bounding
box, centered at the same position. Data augmentation is first used
to increase the size of the training dataset by a factor of 10 by
transforming the crop, using the method from section 2.4.2. For all
the images with another subsequent image belonging to the same
image sequence another crop was pro- duced from the second image,
centered at the object from the first image, with the same width
and height. The crops are all resized to size 224 x 224 x 3 which
is the desired input size for the network.
3.3.2 Network Training
The idea behind the architecture of the network is to teach the
network to find the similarities between the two crops and to get
an idea behind typical movement patterns. During training the
network was fed with image crop pairs created in the training data
preparation. The network then measures the loss between the
predicted value and ground truth value and updates the weight
parameters of the network accordingly. The loss function used was
the Adam loss function [21], using mean absolute error (L1-loss)
and a learning rate of 10−3 together with a batch size of 50. The
tracker was trained multiple times with varying training times. The
training duration was measured in epochs, where one epoch is a full
run-through of the entire training dataset.
32
3.4 Tracking
When initializing the tracker, the only information given are the
coordinates for the initial bounding box’s upper left and lower
right corner, (x1,y1) and (x2,y2). The tracker will then crop the
first two images of the image sequence centered at the initial
bounding box but with twice the width and height such that
Crop width = 2 ∗ (x2 − x1), Crop height = 2 ∗ (y2 − y1).
(3.1)
The output of the network will be the predicted coordinates for the
object at the second image. The tracker will then crop the second
image and then the third image in the same procedure as in (3.1)
but based on the new predicted coordinates and continue doing this
for the entirety of the image sequence. This also highlights the
importance of tracking robustness considering all the training data
is based on the object being close to perfectly centered in the
initial plot. If the trackers output is not particularly accurate
to the true object, the object will no longer be centered for the
next crop pair which will further risk losing track of the object.
This is perhaps the greatest challenge with offline trackers
compared to their counterpart online trackers. The tracking
procedure is illustrated in the flowchart in Figure 3.1.
Figure 3.1: Flowchart of tracking procedure, figure created in
app.diagrams.net
The test/tracking dataset consists of videos from the Visual Object
Tracking
33
(VOT2014) benchmark test dataset. Some of these videos also belong
to the ALOV dataset and were therefore discarded, to prevent
testing on the training data.
3.5 Tracking Using the Jetbot
The Jetbot was connected to a computer through a shared network
using a WiFi adapter which allowed the user to control the Jetbot
using Jupyter Note- book on the computer [19]. The Jetbot included
a 64 GB pre-flashed MicroSD card containing library packages
designed for using the NVIDIA Jetson Nano Developer Kit together
with the Jetbot robot [37]. The camera provided was a Leopard
Imaging Camera with 145 degrees field of view and could easily be
accessed through simple commands provided. A screenshot of a
typical camera view can be seen in Figure 3.2.
Figure 3.2: The view of the Jetbot
Initially, the idea was to run the tracker in real-time with the
Jetbot, but there was a small delay between the image being
recorded and the image being displayed on the computer, making it
difficult to apply the tracker. Instead, an image sequence was
recorded from the Jetbot and the tracker was applied afterwards.
The tracker first had to be reduced in size significantly before
be- ing applied to the robot due to memory limitations. The initial
size of the model was slightly larger than 3GB, mostly attributed
to the large majority of weights between the first fully connected
layer and the previous layer from the two MobileNetV2 networks. The
last layer of the MobileNetV2 networks had accidentally been
discarded, connecting the first fully connected layer to the
previous layer, with significantly higher amount of nodes.
Reintroducing the final layer of the siamese MobileNetV2 networks
led to far fewer weights and a much smaller model, allowing it to
be applied to the Jetbot.
34
4.1 Tracking Scenarios
5 epoch tracker
10 epoch tracker
20 epoch tracker
Figure 4.1: Images taken from video following a ball, trained for
5, 10 and 20 epochs, the objects predicted position according to
the tracker is depicted by the green bounding box
35
Figure 4.2: Region overlap score for the ball sequence, with the
tracker trained for 5, 10 and 20 epochs, red line showing the
chosen threshold at 0.5
Figure 4.1 depicts a red ball being kicked back and forth between
two persons. The images for the 5 and 10 epoch tracker are taken at
the same, chronologically from left to right, depicting the entire
sequence, while the images for the 20 epoch tracker is taken from a
shorter sequence. This is because it loses track earlier. The major
difficulties with this tracking situation lies in the rapid
movement and change of direction of the ball when being kicked and
also the rotation of the ball which is likely largely mitigated by
the distinct, separate background. The clip is rather long at
around 20 seconds which highlights one of the major difficulties
with offline trackers. The tracker is only provided with the
initial bounding box coordinates and has to rely on its own
prediction for future frames to be used as input. This results in
consecutive small errors resulting in larger deviations from the
ground truth value, which in turn makes the tracker unstable. As
can be observed in the figure, the tracker with the shortest
training performed considerably better than the two others,
managing to keep track of the ball for the entire duration.
36
The tracker that had been trained for 10 epochs managed to follow
the ball for some time before the predicted bounding box started to
get stretched out as can be seen in the fourth picture for the
sequence. The stretched bounding box results in a considerably
larger search area, due to the cropping procedure where a crop with
twice the width and height of the bounding box is used. This leads
to a large inclusion of the background for the search area which in
this case includes one of the persons, seen in the last frame where
the tracker suddenly starts to wander off, completely losing track
of the ball.
The tracker that had been trained the longest performed the worse,
con- tinuously expanding before losing track entirely. This could
perhaps be the result of overfitting, where the tracker relies too
heavily on the training set. It’s worth noting that the occurence
of sphere shaped objects is rather limited for the training set
which can increase the difficulty. The fact that the tracker
trained for 5 epochs performed significantly better than the others
could be due to overfitting but it was also the only scenario where
this tracker performed best. A reason for this might be the nature
of the scenario, where an object not frequently seen during
training, with a distinct separate background can more easily be
followed by a tracker not yet adapted to more complex
situations.
Figure 4.2 shows the region overlap between the predicted bounding
box and the ground truth. The 5 epoch tracker had a dip at the
beginning of the sequence but managed to recover, maintaining a
high overlap but with notable oscillations. The 10 epoch tracker
lost track completely at around the 400th frame but managed to
recover before losing track at the end. It’s worth noting that
while the 5 epoch tracker managed to stay above 0.7 for most of the
time, the 10 epoch tracker was often around the threshold. The 20
epoch tracker lost track quickly.
37
5 epoch tracker
10 epoch tracker
20 epoch tracker
Figure 4.3: Images taken from a video following a basketball
player, trained for 5, 10 and 20 epochs, the objects predicted
position according to the tracker is depicted by the green bounding
box
38
Figure 4.4: Region overlap score for the basketball sequence, with
the tracker trained for 5, 10 and 20 epochs, red line showing the
chosen threshold at 0.5
The scenario shown in Figure 4.3 proved to be one of the more
difficult tracking scenarios, involving occlusions by other
players, a highly detailed and dynamic background and a long video
duration. None of the trackers performed particularly well, with
the 20 epoch tracker being worst, immediately drifting off.
Therefore only a short sequence is shown for the 20 epoch tracker.
Both the other trackers performed similarly. After following the
intended green player initially, the tracker switches to the player
in white after the players close en- counter and continuous to
follow this player for a while before eventually losing track.
Since the tracker has no memory during tracking, it’s easy to
understand how it can lose track of the green player when the white
player enters the search region and so while it did manage to
follow one of the players, it was ultimately the wrong player and
can be considered a tracking failure. Interestingly, the tracker
also loses track of the wrong player at the end, seemingly being
drawn by the distinct red circle in the middle.
Figure 4.4 has no particularly interesting observations. The 10
epoch tracker
39
managed to recover for a short duration after completely losing
track, but not for long.
5 epoch tracker
10 epoch tracker
20 epoch tracker
Figure 4.5: Images taken from a video following a bicycle, trained
for 5, 10 and 20 epochs, the objects predicted position according
to the tracker is depicted by the green bounding box
40
Figure 4.6: Region overlap score for the bicycle sequence, with the
tracker trained for 5, 10 and 20 epochs, red line showing the
chosen threshold at 0.5
In Figure 4.5 a woman is shown riding her bicycle down the road.
Once again, the tracker that had been trained the longest performed
the worst, managing to keep track for a short while before being
stretched out and wandering off, showing the same negative
behaviour displayed in the two previous scenarios. The tracker
trained for 5 epochs managed to keep a good track of the woman at
the beginning but eventually loses track after a group of people
passes by in the background and never regains it. The tracker
trained for 10 epochs performed the best, following the woman
accurately for the entire duration despite being a rather
challenging task. The main difficulties with the scenario were the
shaky camera filming, the size change of the woman first getting
closer to the camera before biking away as well as the detailed
background, with a group of people in the middle of the
video.
Figure 4.6 shows the three trackers performance on the bicycle
sequence. Both the 5 and 20 epoch trackers had decent performance
before frame 150, presumably when the crowd of people appeared in
the background. The 5 epoch
41
tracker miraculously managed to recover at the end after having
completely lost track for a significant amount of time but that
could probably be attributed to pure luck. The 10 epoch tracker
performed well for the entire sequence, although with ever
decreasing performance, making it unclear whether it will continue
to perform well for a similar, longer duration video.
5 epoch tracker
10 epoch tracker
20 epoch tracker
Figure 4.7: Images taken from a video following a car, trained for
5, 10 and 20 epochs, the objects predicted position according to
the tracker is depicted by the green bounding box
42
Figure 4.8: Region overlap score for the car sequence, with the
tracker trained for 5, 10 and 20 epochs, red line showing the
chosen threshold at 0.5
Figure 4.7 is notably the only video that’s part of the test data
where the longest trained tracker performed the best. While having
a small obstacle with the trees slightly obscuring the car in the
middle of the sequence, the major difficulty with this task is the
significant size change of the car driving down the road. While the
5 and 10 epoch trackers both managed to follow the car for the
entire duration, they both experienced problems with adapting to
the size change, predicting a slightly bigger bounding box area
than the initial bounding box but nowhere near the size of the car
at the end. Although the 20 epoch tracker still had some problem
with encapsulating the car within the bounding box for the entire
duration, as can be seen in the third and fourth frame, the tracker
is still deemed successful, managing to keep an accurate track of
the car for the full length of the video.
In Figure 4.8 it’s shown how the region overlap score continues to
decrease for the 5 and 10 epoch tracker, both unable to account for
the significant size change for the sequence. The 20 epoch tracker
also has a decreasing tendency
43
5 epoch tracker
10 epoch tracker
20 epoch tracker
Figure 4.9: Images taken from a video following a jogger, trained
for 5, 10 and 20 epochs, the objects predicted position according
to the tracker is depicted by the green bounding box
44
Figure 4.10: Region overlap score for the jogging sequence, with
the tracker trained for 5, 10 and 20 epochs, red line showing the
chosen threshold at 0.5
In Figure 4.9 two women are seen jogging alongside each other. Once
again the longest trained tracker starts by expanding its search
area, quickly losing track and flows away. Both the other trackers
are performing decently, managing to keep track of the intended
jogger, even through a small occlusion by a traffic light pole. The
shortest trained tracker gets slightly confused right at the end,
still tracking the jogger but with the potential risk of losing
track. When the background changes from the distinguishable green
grass to the more light, grey concrete background, the 10 epoch
tracker instead switches to the other jogger.
Figure 4.11 shows how the 5 epoch tracker manages to stay above the
thresh- old for the entire duration with significant oscillations
but no sign of decreasing performance. The 10 epoch tracker on the
other hand manages to keep good track for the most part but has a
significant short dip at the beginning, before recovering. Its
performance is also constantly decreasing, eventually losing track
completely at the end. The 20 epoch tracker lost track multiple
times and never managed to keep a high overlap score.
45
Figure 4.11: Average region overlap score for the tracker trained
for 5, 10 and 20 epochs for all five scenarios with the order of
the scenarios being the same as presented in the Results section,
the threshold value is shown as a red dotted line and the green
line represents the average for all scenarios
Unsurprisingly, after having reviewed the scenarios, the longest
trained tracker at 20 epochs performed the worst, with an average
region overlap slightly above 0.3. Its presence in the report is
mostly justified from its unique performance for
46
the car sequence, otherwise it showed unremarkable results for a
tracker. The 5 and 10 epoch trackers both achieved almost the same
averages, slightly below the chosen threshold at 0.5. While the 5
epoch tracker had great performances of the ball and jogging
sequence, the 10 epoch manages to outperform it both for the
bicycle and car sequence. Both their averages are influenced
heavily by the poor performance for the basketball sequence. The
results for the 5 epoch tracker were rather unpredictable, ranging
from excellent to rather poor (ex- cluding the basketball sequence)
while the 10 epoch tracker didn’t manage to recieve the same long
duration robustness and high region overlap as the 5 epoch tracker
but were more predictable in its performance, hovering slightly
above the threshold, often providing decent tracking for a larger
variety of scenarios but with decreasing performance over time. For
a tracker that’s supposed to be as generic as possible, working for
multiple different scenarios, the 10 epoch tracker might be the
prefered option.
Figure 4.12: The model loss and accuracy while being trained for 50
epochs
Figure 4.12 shows the training accuracy, aswell as the model
training and validation loss when having trained for 50 epochs. The
training loss continues to slightly decrease for the entire
duration while the validation loss also decreases but at a
significantly slower rate. This might be one of the signs that the
model is getting overfitted to the training data. However, it was
difficult pinpointing exactly where this overfitting started to
occur. Other than the trackers trained for 5, 10 and 20 epochs
evaluated in this thesis, trackers trained for 3, 7, 15 and 50
epochs were also evaluated but with insignificant results. Worth
noting is that the 50 epoch tracker performed terribly on the
entire test dataset, showing typical overfitting behaviour. It is
likely that the models performance can be improved with a larger
dataset, allowing for more training before showing signs
47
of overfitting.
4.2 Dataset
The dataset consisted of approximately 16000 annotated frames from
314 video sequences. Increasing the datasize by a factor 10 using
data augmentation results in a training dataset with around 176000
annotated frames. While this is a rather large dataset that
provided adequate training for the tracker to achieve satisfactory
results, there is still a possibility that even greater results
could be attained if the size of the training data was even larger,
including far more videos, or still pictures that could be used for
data augmentation. Especially since more training didn’t
necessarily result in greater results. There is always an upper
limit for how much training that will prove beneficial before the
model gets too overfitted but the results showing how the tracker
that had been trained for only 5 epochs managed to perform notably
better at some situations than the tracker that had been trained
for 10 epochs, which provided a slightly higher average shows that
the training might be improved by using a larger and more diverse
dataset.
4.3 Tracking Speed
The main idea behind using an offline tracker is the speed at which
it can process images. The tracker used in this thesis managed to
achieve a speed of 50 frames per second (fps) using the computer
specified in the report. This speed could likely further be
increased by using different optimization methods for neural
networks, or by using a computer with an even faster GPU.
4.4 Tracking With the Jetbot
After reducing the size of the model, the tracker was applied to
the jetbot. An image sequence of a cup, moving along a desk was
first recorded. Then the tracker was applied. The trackers’
performance had been reduced significantly after having been
reduced in size, as can be seen in Figure 4.13.
Figure 4.13: The reduced tracker being applied to a short image
sequence fol- lowing a cup
The reduced tracker quickly loses track and the bounding box
becomes stretched out. Interestingly, when downloading the images
and running the original tracker on the sequence, the results were
also rather poor. While the
48
performance was better than for the reduced tracker, it still had
trouble fol- lowing the cup accurately. Figure 4.14 shows the
original tracker being applied on the same image sequence,
following the cup for a short while before also becoming stretched
out. This might be an indication that the image from the Jetbot
differs from the training and test datasets images. The Jetbot
camera has a field of view of 135 degrees which might have an
impact on the prediction of the tracker.
Figure 4.14: The original tracker being applied to a short image
sequence fol- lowing a cup
The original model of the tracker had a size larger than 3GB
consisting mostly of millions of weights. The vast amount of these
weights originated from the first fully connected layer and the
last layers of the MobileNet networks. This had initially been a
mistake, where the last layer of the parallel MobileNet networks
from Figure 2.20 had been removed, connecting all the nodes from
the first fully connected layers to the previous layer in the
MobileNet networks. This resulted in considerably more weights than
previously intended but could also have proved to be crucial for
the original trackers’ performance.
In order to be applied to the Jetbot, the size of the tracker had
to be reduced significantly. The size reduction led to a model with
worse performance. Time constraint became an issue due to the long
training times of the models, ranging from 7 hours to 4 days each
time. The loss and validation loss curves can be an indicator for
when the model is starting to overfit to the training data but does
not show when the best model has been achieved, as illustrated by
the different results of the 5, 10 and 20 epoch trackers. The
reduced tracker could possibly also be trained to perform well,
either by trying different hyperparameters, introducing or removing
layers of the network and finding a good training time. If the
number of parameters for the model is crucial for its performance,
the reduced model could be increased slightly, perhaps greatly if
another, more optimized model is used.
49
Conclusion & Further Work
While the tracker did show satisfactory results for many
situations, it still had difficulties with some harder scenarios.
One of the most important aspects when it comes to training neural
networks is the access to good data. This usually translates to a
large and diverse dataset that can further improve training.
Currently, the amount of video sequences with annotated bounding
boxes are rather limited which could limit the potential for
robust, accurate trackers. In the meantime, the dataset could be
expanded using data augmentation on still images with annotated
bounding boxes by creating the illusion of movement.
For any potential application, the training dataset aswell as the
data aug- mentation algorithm could be adjusted to account for some
more specific task, such as a specific set of target objects,
slower or faster movement patterns aswell as a specific background.
While the tracker can be considered very fast com- pared to other
trackers that operates online at all time, some compromise could
perhaps be achieved where the tracker is still extensively trained
offline, while still allowing it to adjust some of its inner
parameters during tracking, allowing for a more robust long
duration tracker, while still maintaining a high speed.
The tracker had to be reduced significantly in size before being
applied to the Jetbot. This was due to its original size of 3GB,
while the Jetbot only had 4GB of RAM. The tracker was reduced to
around 70 MB and then successfully applied to the Jetbot but
performed rather poor on simple tracking scenarios. It’s unclear
whether the reduced tracker can achieve similar performance as the
original tracker but its performance can likely be increased by
further testing. This testing can include hyperparameter tuning,
increased training times, larger training dataset and model
optimization.
50
Bibliography
[1] Taiwo Oladipupo Ayodele. “Machine learning overview”. In: New
Ad- vances in Machine Learning (2010).
[2] SH Shabbeer Basha et al. “Impact of fully connected layers on
performance of convolutional neural networks for image
classification”. In: Neurocom- puting 378 (2020), pp.
112–119.
[3] Luca Bertinetto et al. “Fully-convolutional siamese networks
for object tracking”. In: European conference on computer vision.
Springer. 2016, pp. 850–865.
[4] Christopher M. Bishop. Pattern Recognition and Machine Learning
(Infor- mation Science and Statistics). Berlin, Heidelberg:
Springer-Verlag, 2006. isbn: 0387310738.
[5] Francois Chollet. Deep Learning with Python. Manning, Nov.
2017. isbn: 9781617294433.
[6] Francois Chollet et al. Keras. https://keras.io. 2015.
[7] CS231n: Convolutional Neural Networks for Visual Recognition.
http:
//cs231n.github.io/classification/. Accessed: 2020-02-20.
[8] Jia Deng et al. “ImageNet: a Large-Scale Hierarchical Image
Database”. In: June 2009, pp. 248–255. doi:
10.1109/CVPR.2009.5206848.
[9] Dogs vs. Cats. https://www.kaggle.com/c/dogs-vs-cats. Accessed:
2020-03-20.
[10] Philipp Fischer et al. “Flownet: Learning optical flow with
convolutional networks”. In: arXiv preprint arXiv:1504.06852
(2015).
[11] Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville. Deep
Learning. http://www.deeplearningbook.org. Cambridge, MA, USA: MIT
Press, 2016.
[12] Google Machine Learning Crash Course.
https://developers.google. com/machine-learning/crash-course.
Accessed: 2020-02-20.
[13] Daniel Gordon, Ali Farhadi, and Dieter Fox. “Re3 : Real-Time
Recur- rent Regression Networks for Object Tracking”. In: CoRR
abs/1705.06368 (2017). arXiv: 1705.06368. url:
http://arxiv.org/abs/1705.06368.
[14] Aurlien Gron. Hands-On Machine Learning with Scikit-Learn and
Tensor- Flow: Concepts, Tools, and Techniques to Build Intelligent
Systems. 1st. O’Reilly Media, Inc., 2017. isbn: 1491962291.
[17] Sergey Ioffe and Christian Szegedy. “Batch Normalization:
Accelerating Deep Network Training by Reducing Internal Covariate
Shift”. In: CoRR abs/1502.03167 (2015). arXiv: 1502.03167. url:
http://arxiv.org/ abs/1502.03167.
[18] Jetson Nano Developer Kit Technical Specifications.
https://developer. nvidia.com/embedded/jetson-nano-developer-kit.
Accessed: 2020- 03-06.
[19] Jupyter Notebook. https://jupyter.org/. Accessed:
2020-05-26.
[20] Andrej Karpathy et al. “Large-scale video classification with
convolutional neural networks”. In: Proceedings of the IEEE
conference on Computer Vision and Pattern Recognition. 2014, pp.
1725–1732.
[21] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic
op- timization”. In: arXiv preprint arXiv:1412.6980 (2014).
[22] Samuel Kotz, Tomasz Kozubowski, and Krzysztof Podgorski. The
Laplace Distribution and Generalizations. Jan. 2001, p. 19. isbn:
0-8176-4166-1. doi: 10.1007/978-1-4612-0173-1_5.
[23] Matej Kristan et al. “The visual object tracking vot2015
challenge re- sults”. In: Proceedings of the IEEE international
conference on computer vision workshops. 2015, pp. 1–23.
[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton.
“ImageNet Clas- sification with Deep Convolutional Neural
Networks”. In: Advances in Neural Information Processing Systems
25. Ed. by F. Pereira et al. Cur- ran Associates, Inc., 2012, pp.
1097–1105. url: http://papers.nips.cc/
paper/4824-imagenet-classification-with-deep-convolutional-
neural-networks.pdf.
[25] Poole David L. and Mackworth Alan K. Artificial Intelligence:
Founda- tions of Computational Agents. USA: Cambridge University
Press, 2010. isbn: 0521519004.
[26] Jang Lee and Kwanggi Kim. “Applying Deep Learning in Medical
Images: The Case of Bone Age Estimation”. In: Healthcare
Informatics Research 24 (Jan. 2018), p. 86. doi:
10.4258/hir.2018.24.1.86.
[27] Martn Abadi et al. TensorFlow: Large-Scale Machine Learning on
Het- erogeneous Systems. Software available from tensorflow.org.
2015. url: http://tensorflow.org/.
[28] Pamela McCorduck et al. “History of Artificial Intelligence.”
In: IJCAI. 1977, pp. 951–954.
[29] Tom M. Mitchell. Machine Learning. New York: McGraw-Hill,
1997. isbn: 978-0-07-042807-2.
[30] Siddhartha Sankar Nath et al. “A survey of image
classification methods and techniques”. In: 2014 International
Conference on Control, Instru- mentation, Communication and
Computational Technologies (ICCICCT). IEEE. 2014, pp.
554–557.
[31] Olga Russakovsky et al. “ImageNet Large Scale Visual
Recognition Chal- lenge”. In: International Journal of Computer
Vision (IJCV) 115.3 (2015), pp. 211–252. doi:
10.1007/s11263-015-0816-y.
[32] Mark Sandler et al. “Mobilenetv2: Inverted residuals and
linear bottle- necks”. In: Proceedings of the IEEE conference on
computer vision and pattern recognition. 2018, pp. 4510–4520.
[33] Shibani Santurkar et al. “How does batch normalization help
optimiza- tion?” In: Advances in Neural Information Processing
Systems. 2018, pp. 2483– 2493.
[34] Jurgen Schmidhuber. “Deep learning in neural networks: An
overview”. In: Neural networks 61 (2015), pp. 85–117.
[35] Patrice Y Simard, David Steinkraus, John C Platt, et al. “Best
practices for convolutional neural networks applied to visual
document analysis.” In: Icdar. Vol. 3. 2003. 2003.
[36] Arnold WM Smeulders et al. “Visual tracking: An experimental
survey”. In: IEEE transactions on pattern analysis and machine
intelligence 36.7 (2013), pp. 1442–1468.
[37] SparkFun JetBot AI Kit.
https://www.sparkfun.com/products/retired/ 15365. Accessed:
2020-03-31.
[38] Nitish Srivastava et al. “Dropout: a simple way to prevent
neural net- works from overfitting”. In: The journal of machine
learning research 15.1 (2014), pp. 1929–1958.
[39] Ran Tao, Efstratios Gavves, and Arnold WM Smeulders. “Siamese
in- stance search for tracking”. In: Proceedings of the IEEE
conference on computer vision and pattern recognition. 2016, pp.
1420–1429.
[40] The pdf of the Laplace distribution.
https://commons.wikimedia.org/ wiki/File:Laplace_pdf_mod.svg.
Accessed: 2020-04-03.
[41] Jason Yosinski et al. “How transferable are features in deep
neural net- works?” In: CoRR abs/1411.1792 (2014). arXiv:
1411.1792. url: http: //arxiv.org/abs/1411.1792.
[42] Aston Zhang et al. Dive into Deep Learning. https://d2l.ai.
2020.
[43] Xin Zhang et al. “Object Class Detection: A Survey”. In: ACM
Computing Surveys (CSUR) 46 (Oct. 2013). doi:
10.1145/2522968.2522978.
Lund University Box 118, SE-221 00 Lund, Sweden
http://www.maths.lth.se/
Contents
Introduction
Background
Methodology
Dataset