+ All Categories
Home > Documents > Generic Object Tracking with NVIDIA Jetson Nano Using ...

Generic Object Tracking with NVIDIA Jetson Nano Using ...

Date post: 28-Mar-2022
Category:
Upload: others
View: 11 times
Download: 0 times
Share this document with a friend
57
Generic Object Tracking with NVIDIA Jetson Nano Using Siamese Convolutional Neural Networks Generisk objektsföljning med NVIDIA Jetson Nano med hjälp av siamesiska faltande neurala nätverk Alexander Selberg Lund Institute of Technology Centre for Mathematical Sciences CENTRUM SCIENTIARUM MATHEMATICARUM
Transcript
Generic Object Tracking with NVIDIA Jetson Nano Using Siamese Convolutional Neural Networks Generisk objektsföljning med NVIDIA Jetson Nano med hjälp av siamesiska faltande neurala nätverk
Alexander Selberg
C E
N T
R U
M SC
IE N
T IA
R U
M M
A T
H E
M A
T IC
A R
U M
Abstract
In this thesis, a generic object tracker was constructed that was applied to both a commonly used tracking dataset using a regular computer as well as a robot powered by a small NVIDIA computer. The architecture of the tracker consisted of two parallel convolutional neural networks convolving to a single output. The input consisted of two separate cropped images that were fed into the networks separately. The images depicted an object from an image sequence at time t and t + 1, both centered at the object at time t. The purpose of the network is then to compare the two images and output coordinates for the object’s position at time t + 1.
The tracker was successful in following several objects from a commonly used visual object tracking dataset but performed inconsistently for different scenarios based on its training time. The size of the tracker became a problem when applying it to the robot, requiring significant size reduction. This had a negative effect on the trackers’ performance. The tracker managed to track at up to 60 fps when used on the computer but only around 10 fps for the robot. It’s likely that the tracking performance and speed of the robot can be improved significantly by optimizing the trackers neural network structure as well as adjusting its training duration.
1
Acknowledgements
I wish to show my gratitude to all the people at SAAB c© Dynamics AB that has been a vital part of shaping this thesis. First I’d like to thank Mattias Helsing for providing the opportunity for me to conduct my thesis at SAAB. I’m also very grateful to my supervisor at SAAB, Bjorn Johansson for providing me with valuable discussions and assistance during my time here. I’d also like to offer special thanks to Gabriel Khajo and Richard Barkman at SAAB for their interest in my work and invaluable help.
I’d also like to thank my supervisor at Lund University of Technology, Anders Heyden for providing feedback and assistance during my thesis. Finally, I’d like to thank my examiner, Kalle Astrom for showing genuine interest in my work.
2
Contents
Contents 3
1 Introduction 5 1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.2 Purpose and goal . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 1.3 Equipment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 Theory 8 2.1 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.1 Polynomial Curve Fitting . . . . . . . . . . . . . . . . . . 8 2.1.2 Regularization . . . . . . . . . . . . . . . . . . . . . . . . 11 2.1.3 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Neural Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.2.1 Activation Functions . . . . . . . . . . . . . . . . . . . . . 14 2.2.2 Backpropagation . . . . . . . . . . . . . . . . . . . . . . . 15 2.2.3 Dropout . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 2.2.4 Batch Normalization . . . . . . . . . . . . . . . . . . . . . 17
2.3 Convolutional Neural Networks . . . . . . . . . . . . . . . . . . . 17 2.3.1 Image Classification . . . . . . . . . . . . . . . . . . . . . 18 2.3.2 Convolutional Layers . . . . . . . . . . . . . . . . . . . . . 19 2.3.3 Pooling Layers . . . . . . . . . . . . . . . . . . . . . . . . 20 2.3.4 Fully Connected Layers . . . . . . . . . . . . . . . . . . . 21
2.4 Generic Object Tracking . . . . . . . . . . . . . . . . . . . . . . . 22 2.4.1 Transfer Learning . . . . . . . . . . . . . . . . . . . . . . 23 2.4.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . 23 2.4.3 Region Overlap Score . . . . . . . . . . . . . . . . . . . . 25
2.5 Network Architecture . . . . . . . . . . . . . . . . . . . . . . . . 26 2.5.1 Depthwise Separable Convolutions . . . . . . . . . . . . . 27 2.5.2 Inverted Residuals and Linear Bottlenecks . . . . . . . . . 29
3 Methodology 31 3.1 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.1.1 Data Preprocessing . . . . . . . . . . . . . . . . . . . . . . 31 3.1.2 Data Augmentation . . . . . . . . . . . . . . . . . . . . . 31
3.2 Network Input / Output . . . . . . . . . . . . . . . . . . . . . . . 32 3.3 Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.3.1 Training Data Preparation . . . . . . . . . . . . . . . . . 32 3.3.2 Network Training . . . . . . . . . . . . . . . . . . . . . . . 32
3
3.4 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.5 Tracking Using the Jetbot . . . . . . . . . . . . . . . . . . . . . . 34
4 Results & Discussion 35 4.1 Tracking Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 35 4.2 Dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3 Tracking Speed . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48 4.4 Tracking With the Jetbot . . . . . . . . . . . . . . . . . . . . . . 48
5 Conclusion & Further Work 50 Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4
1.1 Background
The importance of the computer when it comes to the advancements made in modern society cannot be overstated. With it came humanity’s ability to per- form calculations and to solve problems that would otherwise be considered impossible. However, despite the enormous capacity for problem-solving, com- puters still have a hard time with several problems that we as humans consider trivial, for example, speech and object recognition [43]. A person would have no problem identifying a cat seen in a picture, even if the cat was partly oc- cluded, a different color or breed than previously known by the person. The brain is amazing at processing visual information and using previous knowledge to quickly identify new objects. An image according to a computer simply con- sists of a bunch of numbers in a specific order as can be seen in Figure 1.1. For a computer to be able to make sense of this it has to be able to somehow recognize the numbers to make something out of it and this is where Artificial Intelligence comes in.
(a) Number 3 represented as a ma- trix (NMIST dataset)
(b) One color channel of an image of a truck represented as a matrix (CIFAR10 dataset)
Figure 1.1
Artificial Intelligence, or AI for short, has been around as a concept for a very long time. The thought of constructing machines or robots that act and behave like intelligent beings can be found as early as in ancient Greece [28] but
5
it wasn’t until the invention of the modern computer that this dream suddenly appeared within reach. While some probably associate AI with sentient robots and artificial humans the field is a bit broader than that. AI can be defined as the development and study of so-called intelligent agents [25]. An agent in this case can be several things. For example, a thermometer, a dog or a human. Its only definition is that it’s something that acts in an environment. For an agent to be considered intelligent it has to be able to do more than just act, it has to be able to adapt to new environments and come up with solutions and learn from past mistakes. By constantly receiving new feedback the agent is supposed to continuously improve its performance. For this agent to learn and adapt sufficiently it has to gather all the knowledge gained from its experience and define it as a category of concepts By learning a large amount of simpler concepts and from them construct more complex concepts it can gain a better understanding of its environment. By visualizing all these concepts in a graph, built on top of each other in layers, the graph would be considered deep, and this approach to AI is what’s considered Deep Learning [11], a subset of Machine Learning.
Machine Learning can be defined as a learning process where the computer through repeated exposure and gained experience learns to recognize patterns and important features for different kinds of problems. This can be anything from learning to differentiate between spam and non-spam emails, creating an automatic customer system or identifying cats in images [1]. The last-mentioned problem is called an image classification problem and is one of the more difficult problems for a computer to solve simply because of the huge amount of difference there can be between two pictures depicting the same thing. By allowing the computer to train on a large dataset consisting of images with specific objects and constantly providing feedback to the computer on its current performance, the computer will eventually learn which features are important for different objects and hopefully when presented with a picture of a cat the computer will then correctly classify it as a cat. For this object classification to be successful a huge dataset is often needed for the computer to train on and that in turn requires a large computational capacity. Recent years have resulted in massive progress in this area, much thanks to the increase in computational capacity by modern computers. For this kind of problems, Deep Learning can be an effective approach.
Deep Learning in image classification has had a resurgence in recent years when a deep learning model called AlexNet in 2012 outperformed the current state-of-the-art image classifiers using a deep neural network [24]. The objec- tive was to classify 1.2 million high-resolution images into a set of 1000 different classes. The key to their success, they argue, is the access to very large datasets such as ImageNet, consisting of over 15 million high-resolution images with roughly 22.000 categories [8] and their ability to construct such a large network thanks to GPUs. Graphics Processing Units (GPUs) are widely used in video games and thanks to its huge market and competition it has led to continu- ous performance improvements and driven down prices significantly. It turns out that their ability to quickly calculate vector and matrix multiplications in parallel is beneficial to training neural networks and superior to the previously used Central Processing Unit (CPU) [34].
6
1.2 Purpose and goal
In this project the objective is to construct a generic object tracker and imple- ment it with a SparkFun JetBot AI Kit powered by an NVIDIA Jetson Nano Developer Kit computer [37] [18]. Often when constructing object trackers, the object that is supposed to be tracked is known from the start. It might be a tracker whose purpose is to follow football players around the field or a tracker that tracks the cars during a Nascar rally. For these trackers it is sufficient for their models to simply train on one specific object. The purpose of a generic object tracker is to be able to track any object without any prior training on that specific object. Tracking an object consists of knowing its current location at any time during an image sequence and for a generic object tracker to be able to do that it must therefore either be able to accurately predict the objects next location or continuously detect the object for each given frame.
There are two common approaches when constructing object trackers; online and offline trackers. An online object tracker operates online at all times and continuously learns new features and adjusts current features. This can result in a very accurate tracker that has little problem with tracking an objects trans- lation and scaling differences but the obvious downside is the computational effort it takes to constantly recalculate the tracking parameters for every image frame [13].
An offline tracker is the complete opposite, instead of learning new features as it runs, it completely relies on its pretrained model. This requires far less computational effort and can therefore prove beneficial when using the NVIDIA device or other embedded devices. It can still acquire a high accuracy and its evaluation at test time is very fast but the issue lies in its inability to adapt to new situations and its performance will decrease significantly with more difficult tracking scenarios such as large occlusions, significant appearance changes, etc [13].
The goals of this thesis can be divided into three major sections:
• Creating a generic object tracker that can learn to track objects through image sequences found in commonly used datasets or personally created image sequences.
• Applying the tracker alghorithm to the Jetbot and attempt to track real- life objects using its built-in camera.
• Investigate and demonstrate the capabilities and limitations of the NVIDIA device.
1.3 Equipment
The equipments used for this thesis are a computer with an NVIDIA Quadro P4000 Graphics card and a SparkFun Jetbot AI Kit powered by NVIDIA Jetson Nano Developer Kit. The technical specifications for the NVIDIA card can be found here [18].
2.1 Machine Learning
In machine learning, the goal is to allow a computer system to improve its performance on a task through training. A commonly used definition of machine learning is: ”Improving some measure of performance P when executing some task T, through some type of training experience E” [29]. This section aims to provide an understanding for how a computer system can be constructed that automatically improves through experience. A simple regression problem, polynomial curve fitting can be used to introduce several of the key concepts in machine learning.
2.1.1 Polynomial Curve Fitting
Polynomial curve fitting is a regression problem first encountered in statistics. The concept is simple, based on a set of observations x ≡ (x1, ..., xN ), construct a polynomial that can accurately predict corresponding observations of the value t ≡ (t1, ..., tN ) [4]. The polynomial takes the form
y(x,w) = w0 + w1x+ w2x 2 + ...+ wMx
M =
M∑ j=0
wjx j , (2.1)
where M is the order of the polynomial. These regression coefficients wj are usually refered to as weights in ML. The weights will be determined by trying to fit the polynomial to the observed set which is normally done by minimizing a so called loss function. Loss in this case is defined as the difference between the predicted value y(xn,w) and tn and can be visualized as the green bars in Figure 2.1.
The goal is to minimize the loss for all the points in the set which is done by combining all the losses to a loss function. There are several loss functions that can be used, one of the more commonly used is called Mean Square Error (MSE), often refered to as L2 loss and is defined as
L2(w) = 1
Figure 2.1: Visualization of the loss, figure taken from [4]
L2 loss is a quadratic function of the coefficients w and therefore its derivatives will be linear which concludes that there must be a unique solution that mini- mizes the loss function [4]. In fact, for a set containing N points there can always be found a perfect solution, resulting in no loss, from a polynomial of order M = N-1 since the polynomial will contain N degrees of freedom corresponding to the weight coefficients (w0,...,wn)T [4]. At first glance this might seem like the optimal solution and while the goal was to minimize loss, the main objective is to predict a hidden pattern or function that the observations x stem from. Naturally, if the observations x were indeed taken from a specific function at different values, the perfect solution, with zero loss, would be the specific func- tion they were taken from, not necessarily of the order M = N - 1. However, almost all observations include some noise that will have an impact on its value.
The hidden function is not necessarily a polynomial function, and the range might be limited. The 10 observations in Figure 2.2 comes from the function sin(2πx), spaced uniformly in range [0,1] with a small amount of Gaussian dis- tributed noise [4].
Figure 2.2 shows the observered dataset x as the blue dots, the hidden function sin(2πx) as the green curve and the fitted line as the red curve for polynomials of order M = 0, 1, 3 and 9. M = 3 best fits the green curve while M = 0 and 1 are both poor approximations of the hidden function and while M = 9 produces zero loss it is also a poor approximation. Several other observations within the range would produce massive loss and this is a common problem within ML called overfitting, where the function or model is trained too heavily towards the training data, achieving great results for the loss function in regards to the training data but will perform worse for any other data taken from the hidden function. This highlights the importance of dividing the data into different parts.
Another loss function that’s commonly used is the Mean Absolute Error (MAE), also refered to as L1 loss as is defined similar to the L2 loss but with absolute value of the loss as
L1(w) = 1
N∑ n=1
|y(xn,w)− tn|. (2.3)
While L2 loss more heavily punishes outliers and greatly rewards small losses, it
9
Figure 2.2: Curve fitting for different order of polynomials, figure taken from [4]
can sometimes lead to small errors not being penalized enough. L1 loss penalizes smaller errors further which can sometimes be beneficial [15].
Training, validation and test dataset
The dataset, previously refered to as the observered dataset is often divided into three parts:
• Training dataset: The data used in the loss function, usually around 80% of the data.
• Validation dataset: Data not used in the loss function, continuously monitored usually around 10% of the data.
• Test dataset: Previously unseen data, evaluated at the end of the train- ing, usually around 10-20% of the data.
A machine learning models strength lies in its ability to generalize and work on previously unseen data for the same task. Outstanding performance on the training set don’t necessarily result in great performance otherwise, as pre- viously seen in figure 2.2. The goal when training machine learning models therefore becomes to perform well on the test dataset which is only evaluated at the end. The validation dataset can provide a hint for when it’s time to stop training the model.
Figure 2.3 shows the loss for the training and validation set as a function of the training time. Initially, both losses are steadily decreasing but after a
10
Figure 2.3: Illustration of overfitting
while the validation loss starts to increase, while the training loss keeps getting smaller. This can indicate that the model is overfitting to the training dataset and should be a reasonable time to stop training.
2.1.2 Regularization
The problem with overfitting shows that its necessary to take into consideration more than just the loss and one way to reduce overfitting is to incorporate a minimization of the complexity together with the loss function. This is called regularization and is done by adding a penalty term to the loss function (2.2) in the following way, so that the magnitude of the weights will contribute to the new modified function
L = 1
where ||w||2 = wTw = w2 0 +w2
1 + ...+w2 M [4], this is called L2 regularization and
could have other forms aswell [12]. λ is a value that will determine how much the complexity will be encouraged, a higher λ will strengthen the regularization effect and encourage smaller weights while a small λ will allow more complexity and larger weights.
11
2.1.3 Gradient Descent
When trying to reduce loss for a function, an iterative approach is mostly used in practical applications. Perhaps the simplest and most common method is called gradient descent. Observing a curve like the one in Figure 2.4, the process of moving from the starting point to the next point in the plot shows one step of the iterative process. The starting point can be the result of choosing a weight value at random. The gradient for the losscurve is then calculated and since the idea is to reduce the loss, a step is taken towards the direction of the negative gradient and the weight value is updated accordingly. This is done repeatedly till reaching a satisfying loss value, ideally close to the bottom of the curve. The size of the step is decided based on both the magnitude of the gradient aswell as an arbitrary step size, or learning rate [12]. The learning rate is part of a set called hyperparameters. Hyperparameters are values that are determined before the iteration begins and usually needs to be thoroughly tweaked and examined before the model can achieve good results. In this example a learning rate that’s too large would risk stepping over the entire bottom section, returning an even higher loss and never converging towards a desired loss while a learning rate that’s too small would theoretically eventually reach the minimum loss but in practice this can take a very long time [12].
Figure 2.4: Illustration of a convex loss curve, figure taken from [12]
The loss curve in Figure 2.4 is convex, meaning there exists only one mini- mum value for the loss function. Usually this is not the case, and there can be many local minima that gradient descent risks getting stuck in.
When using gradient descent the weights are updated as
w(τ+1) = w(τ) − η∇L(w(τ)), (2.5)
where η > 0 is the learning rate [4]. The loss function here is defined for the entire training set, or batch, a method known as batch gradient descent [4]
12
which can prove troublesome when working with very large training datasets since the loss function is calculated individually for all training examples as
L(w) =
Ln(w), (2.6)
where N is the number of training examples. One method to reduce the com- putational burden is simply by updating the weights based on just one training example, this is a form of gradient descent called stochastic gradient descent, or SGD. The term stochastic comes from the example being chosen at random [12]. Here the computation becomes much quicker but also more noisy and ir- regular due to its stochastic nature [14]. The new gradient descent will jump around more and change directions but the idea is that on average it will work its way down the loss curve. Its irregular behaviour also has the added bene- fit of sometimes escaping local minimas, preventing it from converging towards higher losses. The weights are now updated as
w(τ+1) = w(τ) − η∇Ln(w(τ)). (2.7)
Both examined examples of gradient descent can be seen as extreme examples since the number of examples used are either all or simply one. A natural compromise would be to use a smaller batch of examples, still reducing the computational burden and resulting in quicker calculations and convergence than batch gradient descent. This method is known as mini-batch gradient descent [14] and the reason why this is usually preferred to SGD has to do with it being slightly less erratic and irregular, but also the fact that it is more computationally efficient to calculate the gradient once for 100 examples, than 100 times with one example [7]. The number of examples used in mini- batch gradient descent is known as batch size and is also a hyperparameter like the previously mentioned learning rate. Choosing a suitable batch size is usually done simply by trying out different values and observing the models performance. Its common to choose batch sizes of the power of 2 such as 32, 64, 128 because in practice, many vectorized operation implementations work faster when their inputs are in the power of 2 [7].
2.2 Neural Networks
Neural networks as a concept has existed since the 1950s [5] and although both inspired and often likened to information processing in biological systems its similarities are usually exaggerated. In ML, a neural network consists of three types of layers, the input layer which handles the input to the network, one or several hidden layers which sums together the inputs from the previous layer, each individually multiplied with associated weights and finally multiplied with an activation function, and at last, an output layer which produces the networks output. Figure 2.5 shows a simple neural network with two inputs, three neurons in the single hidden layer and two outputs aswell as two bias terms x0 and h0. The grey lines symbolize a weight multiplication and the arrow denotes in which order the calculations are executed. In this figure all the grey lines point in the right direction, such neural networks, where all operations occurs in the same
13
x0
x1
x2
Figure 2.5: Neural network containing one hidden layer
direction from the input to the output layer are called Feed-Forward networks [4].
The input is summed together in each hidden neuron together with its re- spective weights as
aj =
w (1) ji xi, (2.8)
where j = 1,...,M, x0 = 1 and M is the amount of hidden neurons in the layer excluding the bias. The superscript (1) refers to the first hidden layer of the network. The quantities aj are refered to as activations [4] In the provided example there is only one hidden layer but there can also be several. Regardless of how many layers are being used, their intended purpose is not yet clear. So far, the output of the network assumes a linear correlation with the inputs, the hidden layer has just provided a broader range of linear combinations of the inputs but the network is unable to handle nonlinear problems. The solution to this lies in the usage of activation functions.
2.2.1 Activation Functions
Activation functions are applied after each hidden layer and transforms the activation in (2.8) to a nonlinear function. The purpose of activation functions is to introduce complexity into the network and allow it to model more complex and nonlinear problems. The output of each hidden node then becomes
zj = σ(aj) = σ
( D∑ i=0
) , (2.9)
where σ is the activation function. Typical activation functions consists of the rectified linear unit activation function ReLU and the Sigmoid function.
Figure 2.6 shows the two most common activation functions. ReLU retains only positive input and discards all the negative input by setting them to zero. Despite its simplicity it has proven to often provide the greatest results while still being very simple to compute [24]. The Sigmoid function instead converts all activations to values between 0 and 1. Which activation function to use is
14
(a) ReLU function: F(x) = max(0,x) (b) Sigmoid function: F(x) = 1 1+e−x
Figure 2.6: Two common activation functions. Made in python using matplotlib
more often decided based on what works best rather than some fixed set of rules [12].
Combining the outputs from the hidden layer from (2.9) with the last set of weights the output of the model then becomes
yk(x,w) =
) (2.10)
where the bias term w (2) k0 is the output for j=0.
The goal is to find suitable values for the weights that will produce good output values. Good here means to predict values as close to the ”true” values as possible, or minimizing the Loss function. The way this is done in neural networks is through backwards propagation.
2.2.2 Backpropagation
The important contribution of the backpropagation technique is its ability to calculate the gradient ∂L
∂wij for all the weights in a computationally efficient
manner [4]. The Loss function for the derivation of backpropagation is chosen to be L = 1
k
∑ 2k(y(xk,w)− tk)2 for simplicity’s sake. This results in ∂E
∂y = yk− tk. The chain rule can now be used to calculate the loss with respect to the
weights connected to the output layer as
∂L
∂wkj =
∂L
∂L
15
Same type of calculations can then be used for all weights in the network before updating them and running another forward pass. Continuously doing this will then hopefully result in a more accurate network. This also illustrates another benefit with using the previously mentioned activation functions ReLU and Sigmoid, its derivatives are
∂(ReLU(x))
∂S(x)
2.2.3 Dropout
Dropout is an extremely effective, simple regularization technique often used in neural networks [7]. The way it works is by applying a probability p for staying active to each node in the hidden layers of the network. A p-value of 0.5 would mean each node had a probability of 50% at staying active, and a probability (1− p) = 50% of being set to zero. The process can be illustrated in Figure 2.7.
Figure 2.7: Dropout being a applied to a neural network, figure taken from [38]
It might appear counterintuitive to simply remove certain nodes from the network based on some probability but it has been shown to be very effective at preventing overfitting and also speeding up training notably [38]. The idea is to force the network not to rely too heavily on certain nodes that can have a strong adaptation to the training set. Dropout is usually only applied to the network during training time and is not used during testing [7].
16
2.2.4 Batch Normalization
Batch normalization is a commonly used technique that can be used to improve the speed, performance and stability of the network [17]. Its purpose is to stabilize the distribution of layer inputs. This is achieved by introducing new network layers that control the mean and variance of these distributions. The widespread understanding of its success comes from its assumed reduction of the internal covariate shift (ICS) [17]. ICS refers to the change in distribution of some layer input caused by updates to the preceding layers [33]. New evidence points at other underlying reasons for the success of batch normalization, such as smoothening of the optimization landscape, leading to more predictive and stable behaviour from the gradients, allowing for faster and more stable training [33].
Regardless of the exact reason behind the success of batch normalization, the method is commonly used with great success in a large variety of networks [33] and the process is rather simple. For each activation x(k) in the network, the so-called Batch Normalization Transform (BN) can be applied to mini-batches of the training set, similar to the mini-batch gradient descent in 2.1.3. The input consists of values x for a single activation over a mini-batch: B = {x1 . . . xm} with two parameters γ and β to be learned. The transform goes as
µB ← 1
, (2.16)
yi ← γxi + β ≡ BNγ,β(xi), (2.17)
which produces the output {yi = BNγ,β(xi)} where ε is a constant added to the mini-batch variance for numerical stability, µB is the mini-batch mean and σ2
B is the mini-batch variance [17].
2.3 Convolutional Neural Networks
A convolutional neural network, or CNN is a class of deep neural networks com- monly used when dealing with visual tasks such as image classification, object tracking or semantic segmentation [42]. Deep learning resurfaced in 2012 when a CNN was used at the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [24] [31] which greatly outperformed all other contestants. The pur- pose of the challenge was to classify an object in an image for 10 million high resolution images consisting of 1000 different classes. One evaluation was to compare the 5 most likely classes according to the models and see if any pre- diction matches the ground truth. The network used in [24] was called AlexNet and achieved a top-5 test error rate of 15.3% compared to the second best en- try at 26.2%. Following this remarkable improvement numerous other CNNs have appeared, further improving and breaking new records for image related problems.
17
2.3.1 Image Classification
Image classification is the task of assigning an image input a label from a fixed set of categories [7]. It’s one of the core problems in Computer Vision and there exist several different approaches for how to solve them [30]. An image usually consists of a three dimensional array (w, h, d) = (w, h, 3) where w is the width of the image in pixels, h the height and d the depth which are the three red, green and blue (RGB) color channels. The pixels (elements) of the array are integers ranging from 0 (black) to 255 (white) [7]. This means that an 256x256 pixel image will have 256 ∗ 256 ∗ 3 = 196608 integers. Using large datasets will thus result in massive amounts of data needing to be processed but is crucial for the performance of deep neural networks.
The process of achieving image classification using CNNs can be described in a very basic way using a few steps. First the architecture of the network must be constructed with the correct input and output specifications. The possible outputs are all the specified classes and can be either a simple 1 for the predicted class and 0 for the rest or a probability percentage for each class, adding to 1. The network is then fed, or trained with the entire training set of images and given the correct output. The layers in the network will discover and save features from all the images with the idea that some combination of features strongly correlate to a specific class so that when an image is fed into the trained network the network will now be able to predict the correct label. This process can be visualized in Figure 2.8 and 2.9.
Figure 2.8: Convolutional neural network being trained with images of cats and dogs with ground truth labels
While Figure 2.8 implies that there were only a total of 12 images used for training, normally training datasets are much larger. The dataset Dogs vs. Cats [9] consists of 25.000 training images of dogs and cats.
18
Figure 2.9: Test image fed into the network with correct prediciton
2.3.2 Convolutional Layers
Convolutional layers are the main building blocks in CNNs. They consist of a number of two-dimensional feature maps whose size is dependent on several factors: the input size, the filter size, whether padding is enabled or not and the size of the stride. The input size is the previous layer, for example the image of size (256x256x3), the filter is a small three-dimensional matrix, usually in size 3x3x3 or 5x5x3. The depth corresponds to which filter belongs to which color channel. The filters are then multiplied for each depth using dot product from one corner of the matrices, sliding, or convolving across the matrix. The dot product for each step for all depths are added together and produce one piece of the feature map as output [7]. The number of filters used corresponds to the final amount of feature maps. The first step of the process can be seen in Figure 2.10.
The figure shows an image of size 5x5x3 with zero-padding enabled. Zero- padding is when adding an outer shell to the input matrices, consisting solely of zeros. Each channel is being dot multiplied with a corresponding depth for each filter with an added bias term for the output. The filter is moved 2 steps to the right for each step. More specifically it’s the first step that’s highlighed and the calculations done are0 0 0
0 1 0 0 2 1
· 1 −1 −1
+
+
· 0 1 −1
+ 1 = −2.
One of the main advantages using this method compared to the previously mentioned calculations in regular neural networks is the considerably smaller amount of computations required. In regular neural networks all neurons in a layer are connected with every neuron in adjacent layers. For an image input of size 256x256x3 were every pixel represent a node, would result in 256∗256∗3 = 196608 weights for every single node in the next layer. Usually, way more neurons are necessary and therefore would result in a massive amount of weights
19
Figure 2.10: Convolutional layer with two filters W0 and W1, zero-padded with a stride of 2 applied to an image of size 5x5x3 producing two feature maps of size 3x3, figure taken from [7]
needing to be continuously calculated and updated. For convolutional layers each element for each filter has its individual weight meaning that the amount of weights is only dependent on the size of the convolutional layer. A convolutional layer with filters of size 3x3x3 with depth 64, or 64 feature maps instead only has (3 ∗ 3 ∗ 3) ∗ 64 = 1728 weights attached to it.
2.3.3 Pooling Layers
In CNNs its common to periodically insert a pooling layer in-between successive convolutional layers, the idea is to progressively reduce the spatial size of the network to reduce the amount of parameters and computational burden aswell as preventing overfitting [7]. A common pooling filter to use is an 2x2 filter with stride 2 which will discard 75% of all activations which can be seen in Figure 2.11. The principle is similar to the convolutional layers with sliding filters but instead of performing the matrix dot multiplication the pooling filter will instead choose a single value based on the type of pooling layer used. The most commonly used is maxpooling which will pick the largest value in the filter and discard all the rest.
20
Figure 2.11: Maxpooling with an 2x2 pool filter and stride 2
This approach might appear unintuitive at first, discarding a large amount of potentially important activations based on some arbitrary approach. However, when dealing with images, adjacent pixels usually show a strong correlation so its possible to reduce the resolution without losing the distingushing features and patterns.
2.3.4 Fully Connected Layers
The last layers of CNNs usually consist of a single or several fully connected layers (FCL). These layers work the same as the layers in regular neural networks where every node in the layer is connected to every node in the adjacent layers. For large FCLs they often represent almost all parameters in the entire network and is therefore responsible for fitting complex nonlinear discriminant functions in the feature space into which the input data elements are mapped [2]. For example, the previously mentioned AlexNet has 60 million parameters with 58 million belonging to the last three FCLs [24].
21
Figure 2.12: The network architecture of AlexNet, taken from [24]
Figure 2.12 shows the network architecture of the original AlexNet. The reason why it had two parallel networks has to do with the limitations in GPU memories back in 2012 when the paper was released [24] and therefore two GPUs were used. A more commonly used architecture these days is called CaffeNet and is the concatenated version seen in Figure 2.13.
Figure 2.13: The network architecture of CaffeNet, taken from [26]
This network consists of an input image with size 224x224x3, 5 convolutional layers and 3 fully connected layers with the last layer representing the number of classification categories in the Imagenet Large Scale Visual Recognition Com- petition [31]. It’s still widely used as part of the network for several recent state of the art applications [13] [15] [39].
2.4 Generic Object Tracking
The goal with generic object tracking is rather easy to formulate. Based on solely an initial set of coordinates for a bounding box encompassing the desired, arbitrary object in the initial image frame, predict the objects location for all future frames [23]. Current generic object trackers predominantly rely on learn- ing its tracking online, meaning they run and update in real-time, detecting the
22
object for each frame and updating with regard to possible appearance changes occuring [13]. This can produce very accurate long-term trackers that are robust to occlusions, appearance and lighting changes with the added downside of only allowing more simpler models with slower run-time due to the computational effort it takes to constantly update the tracker in real-time [3]. An alternative offline approach would be to pre-train the entire model on a large dataset and locking all the parameters in run-time to allow for very fast tracking [15]. This method might also be beneficial when used with mobile or embedded devices which has a tighter computational constraint.
2.4.1 Transfer Learning
Transfer Learning is the idea of using knowledge obtained from a previous prob- lem and applying it to a new problem [41]. For object tracking its possible to begin with a model trained on image classification before training it for track- ing. The idea is that the pretrained network will provide the new model with some underlying understanding of the appearance of different objects, if the pretrained dataset is large and diverse enough the feature maps learned can act as a generic model of the visual world with potential applications for a wide range of computer vision related problems [5].
2.4.2 Data Augmentation
One fundamental characteristic of deep learning is the access to large datasets. Since its the responsibility of the network to find the important features instead of manual feature engineering, a large training dataset is thus needed, especially for high-dimensional inputs such as images [5]. Datasets used for training a network can occasionally be limited in their size, perhaps resulting in insufficient data for training a network to a higher capacity [35]. This further risks resulting in overfitting the network to the training data. One effective way around this is augmentation of the data. This can easily be done by translations, scaling, rotating or applying many different methods to an image from the training set. An example of translation with scaling and horizontal mirroring can be seen in Figure 2.14.
23
Figure 2.14: Data augmentation of an image of a cat
By transforming the image, the computer is led to believe its exposed to new images which will help prevent overfitting and can lead to a great increase of the previous datasets size. When training AlexNet, the researchers increased their training set by a factor of 2048 through data augmentation [24].
To create an artificial motion for an image, image transformation can be used. By assuming that the subsequent image frames for an image sequence are taken in a small time interval during object tracking, it’s expected that the object most likely have moved very little relative to its previous position. One way to imitate this movement could be to apply a translation based on the width and heigth of the object multiplied with a laplace distribution such as in [15]
c ′
y = cy + h ∗y,
where w and h is the width aswell as the height of a cropped image containing the object and x & y which can be modeled with a laplace distribution with mean 0. Accounting for potential size changes or translations, the width and height can also be transformed in a similar manner
w′ = w ∗ γw, h′ = h ∗ γh,
where γw and γh are laplace distributed variables with mean 1. This means that the highest probability is that the width and height remains the same but also allows for potential size changes.
A variable is said to have a laplace distribution if its probability density function is [22]
f(x;µ, b) = 1
24
where µ ∈ (−∞,∞) and b > 0 are location and scale parameters, respectively [22]. The probability density function is visualized in Figure 2.15
Figure 2.15: Probability density functions for different parameter values, figure taken from [40]
2.4.3 Region Overlap Score
One way to measure the performance of the tracker other than simply visually evaluating its tracking, is to calculate its region overlap, often refered to as accuracy.
25
Figure 2.16: Illustration of the idea behind region overlap measurements. The red box depicts the predicted location of an object while the green box is the objects actual location
Figure 2.16 shows a simple illustration of a predicted bounding box in red, overlapping with the ground truth bounding box in green. FP stands for false positive, which is the area that the tracker believes to be the object, while it’s not. TP stands for false positive and is the area where the tracker predicts the objects location correctly. FN stands for false negative and is the part of the object that the tracker fails to predict.
An evaluation of the trackers performance can then be created as
TP
TP+FP+FN . (2.19)
A perfect performance where the predicted bounding box is in the exact same position as the ground truth bounding box would therefore result in FP=FN=0 and 2.19 resulting in 1. In the same way the performance would equal 0 if the area of the true positive would be 0.
2.5 Network Architecture
The network is constructed with two pre-trained parallel MobileNetV2 networks [32] ending in two FCLs as can be seen in Figure 2.17. This network architecture, with two parallel, identical networks with the same weights can be refered to as a siamese neural network.
26
Figure 2.17: Network architecture with two cropped images as input and bound- ing box coordinates as output
The input to each network is a cropped section of an image taken at time t and t + 1 centered at the object of interest at time t. The assumption is that objects move smoothly through space and the previous position should then be a reasonable place to look for the object’s current position. The network should also develop an understanding of typical movement without including too much of the background. The output of the network will be the coordinates for the upper left and lower right corner of the bounding box capturing the object. Figure 2.17 shows a cross-country skier moving forward at a pace almost surpassing the cropped region. The position of the skier at time t+1 is then predicted and given as output. It’s important that the subsequent frames are close enough in time and that the cropped regions are large enough so that the object has not moved outside the cropped region at time t+1.
The choice of the MobileNetV2 network is based on it being adapted to mobile or embedded devices through using less memory and computations than more conventional networks, mostly thanks to its implementation of depthwise separable convolutions.
2.5.1 Depthwise Separable Convolutions
MobileNet aswell as MobileNetV2 are two models based on depthwise separable convolutions which is a form of convolution factorized into two separate parts, the depthwise convolution and the pointwise convolution [16]. The reason why such factorization might be desirable is the low computational effort compared to the standard convolutional layers in 2.3.2. The structure of the factorized parts are illustrated in Figure 2.18.
27
Figure 2.18: Standard convolutional layers being factorized into depthwise and pointwise convolution, figure taken from [16]
Assuming a zero-padded input layer using a stride of one of size DI x DI x M where DI represent the input width and height and where M are the usual three color channels in the case of images, with N number of filters of size DK x DK x M, a standard convolutional layer will perform a total of
DI · DI ·M ·N · DK · DK (2.20)
computations. With depthwise separable convolutions, depthwise convolutional filters are
first used which are filters of size DK × DK × 1. There is only one filter for each input channel which results in there being a total of DI · DI ·M · DK · DK computations in the first step. This process filters each input channel into a new output feature map. The second step then creates a linear combination of each depth channel to produce new features through pointwise convolution. Convolutional filters of size 1 x 1 x M are applied to the output feature map from step one to create a final output feature map of size DI×DI×N where N is the
28
number of pointwise convolutional filters used. The number of computations of the second step amounts to M ·N · DI · DI leading to a total of
DI · DI ·M · DK · DK +M ·N · DI · DI (2.21)
computations. This often results in significantly less computations than for the standard convolution in (2.20). The relation between (2.20) and (2.21) becomes
DI · DI ·M · DK · DK +M ·N · DI · DI DI · DI ·M ·N · DK · DK
= 1
N +
1
(2.22)
which results in between 8 or 9 times less computations for the depthwise sepa- rable convolutions of size 3×3 used in MobileNet at only a small loss in accuracy than for the equivalent standard convolutional layer [16].
2.5.2 Inverted Residuals and Linear Bottlenecks
MobileNetV2 predominantly consists of inverted residual blocks with linear bot- tlenecks, more simply referred to as bottlenecks [32]. Its structure can be seen in Figure 2.19 where a 1× 1 point-wise convolution is performed for each input channel k, transforming the low-dimensional tensor into a higher-dimensional space. Then a ReLU6 activation is applied before performing depth-wise convo- lution using 3×3 filters followed by another ReLU6 activation. ReLU6 is similar to the previously discussed ReLU but instead has an upper limit of 6, ReLU6 = max(0,min(x, 6)) instead of the usual ReLU = max(0, x). It’s used due to its robustness when dealing with low-precision computations [16]. Finally an- other 1×1 point-wise convolution is performed, projecting the feature map back to a lower-dimensional tensor. Due to the inevitable information loss occuring from the last projection, empirical studies have shown that its important that a linear activation is used for the last layer as to prevent destroying too much information [32]. A skip connection between the bottlenecks, or shortcut is also implemented to allow the gradient to propagate through multiple layers.
Figure 2.19: Inverted residual block with linear bottleneck, figure taken from [32]
The idea behind this network architecture is built on the presumption that the information from a set of layer activations actually lie in some manifold,
29
which in turn is embedabble into a low-dimensional subspace [32]. The thin layers in Figure 2.19 represent those subspaces, allowing fewer computations and less memory intensive networks than more conventional networks, while also retaining the important information for the network to achieve comparable results.
The architecture for the MobileNetV2 network can be seen in Figure 2.20 where bottleneck refers to the layer in Figure 2.19. The expansion rate is given by t, the factor which the amount of channels are increased by from the first step in the bottleneck procedure. The number of output channels is denoted by c, n shows how many times the same layer was repeated in a sequence, s denoted the stride.
Figure 2.20: Network architecture of MobileNetV2, figure taken from [32]
30
Methodology
The tracker was built in Python using Keras [6], the open-source neural-networks library built on top of Tensorflow [27], an open-source library for Machine Learn- ing applications. Keras provides a framework for constructing the necessary components for using neural networks. A NVIDIA Quadro P4000 GPU was used to train the tracker.
3.1 Dataset
The image sequences and bounding box annotations used when training this network comes from the Amsterdam Library of Ordinary Videos 300++ dataset (ALOV) [36]. The dataset consists of approximately 90000 frames from 314 video sequences ordered in 14 different categories aimed to cover a diverse set of circumstances such as illumination, transparency, specularity, confusion with similar objects, clutter, occlusion, zoom, severe shape changes, motion patterns, low contrast images and more [36]. The image sequences comes from short videos with an average length of 9.2 seconds ranging to a few minutes at most. Every fifth video frame is annoted by a ground truth rectangular bounding box.
The part of the network excluding the fully connected layers are pre-trained on ImageNet [31] before freezing the associated weights. This means that only the weights connected to the fully connected layers will be adjusted during training.
3.1.1 Data Preprocessing
The mean value for each color channel in the ImageNet training set is first subtracted from all the image training data before normalizing the data by dividing all pixels by 255, the largest pixel value.
3.1.2 Data Augmentation
The training dataset is increased by a factor 10 using data augmentation. Each image crop is transformed using the method from section 2.4.2 to produce the second image crop to use as input. If the augmented image had a width or height 40% larger or smaller than the original image, the augmented image got
31
discarded. The same was done for the augmented images where the objects center had moved more than half the original images width or height. This was done to prevent the network from being exposed to unrealistic scenarios or excessive movements.
3.2 Network Input / Output
The network architecture consists of two parallel MobileNetV2 networks [32] concatenated into two fully connected layers with 2048 nodes each. The input to each network consists of one image crop each with size 224 x 224 x 3, from two subsequent timesteps, similar to [15], [13], [20], [10]. The crops are taken from an image sequence at time t and t + 1. The positioning of the cropped section for both crops are centered at the object at time t with an width and height twice the size of bounding box width and height. The output of the network will then be the bounding box coordinates for the object at time t + 1.
It’s important that the time between two subsequent image frames are small enough that the object has not become too occluded or moved far enough to be outside the crop at time t + 1.
3.3 Training
3.3.1 Training Data Preparation
The ALOV dataset used for training consists of large images of different sizes with corresponding bounding box coordinates for a portion of the images. The images annoted by ground truth coordinates were read and the objects within the images were cropped with twice the width and size of the ground truth bounding box, centered at the same position. Data augmentation is first used to increase the size of the training dataset by a factor of 10 by transforming the crop, using the method from section 2.4.2. For all the images with another subsequent image belonging to the same image sequence another crop was pro- duced from the second image, centered at the object from the first image, with the same width and height. The crops are all resized to size 224 x 224 x 3 which is the desired input size for the network.
3.3.2 Network Training
The idea behind the architecture of the network is to teach the network to find the similarities between the two crops and to get an idea behind typical movement patterns. During training the network was fed with image crop pairs created in the training data preparation. The network then measures the loss between the predicted value and ground truth value and updates the weight parameters of the network accordingly. The loss function used was the Adam loss function [21], using mean absolute error (L1-loss) and a learning rate of 10−3 together with a batch size of 50. The tracker was trained multiple times with varying training times. The training duration was measured in epochs, where one epoch is a full run-through of the entire training dataset.
32
3.4 Tracking
When initializing the tracker, the only information given are the coordinates for the initial bounding box’s upper left and lower right corner, (x1,y1) and (x2,y2). The tracker will then crop the first two images of the image sequence centered at the initial bounding box but with twice the width and height such that
Crop width = 2 ∗ (x2 − x1), Crop height = 2 ∗ (y2 − y1). (3.1)
The output of the network will be the predicted coordinates for the object at the second image. The tracker will then crop the second image and then the third image in the same procedure as in (3.1) but based on the new predicted coordinates and continue doing this for the entirety of the image sequence. This also highlights the importance of tracking robustness considering all the training data is based on the object being close to perfectly centered in the initial plot. If the trackers output is not particularly accurate to the true object, the object will no longer be centered for the next crop pair which will further risk losing track of the object. This is perhaps the greatest challenge with offline trackers compared to their counterpart online trackers. The tracking procedure is illustrated in the flowchart in Figure 3.1.
Figure 3.1: Flowchart of tracking procedure, figure created in app.diagrams.net
The test/tracking dataset consists of videos from the Visual Object Tracking
33
(VOT2014) benchmark test dataset. Some of these videos also belong to the ALOV dataset and were therefore discarded, to prevent testing on the training data.
3.5 Tracking Using the Jetbot
The Jetbot was connected to a computer through a shared network using a WiFi adapter which allowed the user to control the Jetbot using Jupyter Note- book on the computer [19]. The Jetbot included a 64 GB pre-flashed MicroSD card containing library packages designed for using the NVIDIA Jetson Nano Developer Kit together with the Jetbot robot [37]. The camera provided was a Leopard Imaging Camera with 145 degrees field of view and could easily be accessed through simple commands provided. A screenshot of a typical camera view can be seen in Figure 3.2.
Figure 3.2: The view of the Jetbot
Initially, the idea was to run the tracker in real-time with the Jetbot, but there was a small delay between the image being recorded and the image being displayed on the computer, making it difficult to apply the tracker. Instead, an image sequence was recorded from the Jetbot and the tracker was applied afterwards. The tracker first had to be reduced in size significantly before be- ing applied to the robot due to memory limitations. The initial size of the model was slightly larger than 3GB, mostly attributed to the large majority of weights between the first fully connected layer and the previous layer from the two MobileNetV2 networks. The last layer of the MobileNetV2 networks had accidentally been discarded, connecting the first fully connected layer to the previous layer, with significantly higher amount of nodes. Reintroducing the final layer of the siamese MobileNetV2 networks led to far fewer weights and a much smaller model, allowing it to be applied to the Jetbot.
34
4.1 Tracking Scenarios
5 epoch tracker
10 epoch tracker
20 epoch tracker
Figure 4.1: Images taken from video following a ball, trained for 5, 10 and 20 epochs, the objects predicted position according to the tracker is depicted by the green bounding box
35
Figure 4.2: Region overlap score for the ball sequence, with the tracker trained for 5, 10 and 20 epochs, red line showing the chosen threshold at 0.5
Figure 4.1 depicts a red ball being kicked back and forth between two persons. The images for the 5 and 10 epoch tracker are taken at the same, chronologically from left to right, depicting the entire sequence, while the images for the 20 epoch tracker is taken from a shorter sequence. This is because it loses track earlier. The major difficulties with this tracking situation lies in the rapid movement and change of direction of the ball when being kicked and also the rotation of the ball which is likely largely mitigated by the distinct, separate background. The clip is rather long at around 20 seconds which highlights one of the major difficulties with offline trackers. The tracker is only provided with the initial bounding box coordinates and has to rely on its own prediction for future frames to be used as input. This results in consecutive small errors resulting in larger deviations from the ground truth value, which in turn makes the tracker unstable. As can be observed in the figure, the tracker with the shortest training performed considerably better than the two others, managing to keep track of the ball for the entire duration.
36
The tracker that had been trained for 10 epochs managed to follow the ball for some time before the predicted bounding box started to get stretched out as can be seen in the fourth picture for the sequence. The stretched bounding box results in a considerably larger search area, due to the cropping procedure where a crop with twice the width and height of the bounding box is used. This leads to a large inclusion of the background for the search area which in this case includes one of the persons, seen in the last frame where the tracker suddenly starts to wander off, completely losing track of the ball.
The tracker that had been trained the longest performed the worse, con- tinuously expanding before losing track entirely. This could perhaps be the result of overfitting, where the tracker relies too heavily on the training set. It’s worth noting that the occurence of sphere shaped objects is rather limited for the training set which can increase the difficulty. The fact that the tracker trained for 5 epochs performed significantly better than the others could be due to overfitting but it was also the only scenario where this tracker performed best. A reason for this might be the nature of the scenario, where an object not frequently seen during training, with a distinct separate background can more easily be followed by a tracker not yet adapted to more complex situations.
Figure 4.2 shows the region overlap between the predicted bounding box and the ground truth. The 5 epoch tracker had a dip at the beginning of the sequence but managed to recover, maintaining a high overlap but with notable oscillations. The 10 epoch tracker lost track completely at around the 400th frame but managed to recover before losing track at the end. It’s worth noting that while the 5 epoch tracker managed to stay above 0.7 for most of the time, the 10 epoch tracker was often around the threshold. The 20 epoch tracker lost track quickly.
37
5 epoch tracker
10 epoch tracker
20 epoch tracker
Figure 4.3: Images taken from a video following a basketball player, trained for 5, 10 and 20 epochs, the objects predicted position according to the tracker is depicted by the green bounding box
38
Figure 4.4: Region overlap score for the basketball sequence, with the tracker trained for 5, 10 and 20 epochs, red line showing the chosen threshold at 0.5
The scenario shown in Figure 4.3 proved to be one of the more difficult tracking scenarios, involving occlusions by other players, a highly detailed and dynamic background and a long video duration. None of the trackers performed particularly well, with the 20 epoch tracker being worst, immediately drifting off. Therefore only a short sequence is shown for the 20 epoch tracker. Both the other trackers performed similarly. After following the intended green player initially, the tracker switches to the player in white after the players close en- counter and continuous to follow this player for a while before eventually losing track. Since the tracker has no memory during tracking, it’s easy to understand how it can lose track of the green player when the white player enters the search region and so while it did manage to follow one of the players, it was ultimately the wrong player and can be considered a tracking failure. Interestingly, the tracker also loses track of the wrong player at the end, seemingly being drawn by the distinct red circle in the middle.
Figure 4.4 has no particularly interesting observations. The 10 epoch tracker
39
managed to recover for a short duration after completely losing track, but not for long.
5 epoch tracker
10 epoch tracker
20 epoch tracker
Figure 4.5: Images taken from a video following a bicycle, trained for 5, 10 and 20 epochs, the objects predicted position according to the tracker is depicted by the green bounding box
40
Figure 4.6: Region overlap score for the bicycle sequence, with the tracker trained for 5, 10 and 20 epochs, red line showing the chosen threshold at 0.5
In Figure 4.5 a woman is shown riding her bicycle down the road. Once again, the tracker that had been trained the longest performed the worst, managing to keep track for a short while before being stretched out and wandering off, showing the same negative behaviour displayed in the two previous scenarios. The tracker trained for 5 epochs managed to keep a good track of the woman at the beginning but eventually loses track after a group of people passes by in the background and never regains it. The tracker trained for 10 epochs performed the best, following the woman accurately for the entire duration despite being a rather challenging task. The main difficulties with the scenario were the shaky camera filming, the size change of the woman first getting closer to the camera before biking away as well as the detailed background, with a group of people in the middle of the video.
Figure 4.6 shows the three trackers performance on the bicycle sequence. Both the 5 and 20 epoch trackers had decent performance before frame 150, presumably when the crowd of people appeared in the background. The 5 epoch
41
tracker miraculously managed to recover at the end after having completely lost track for a significant amount of time but that could probably be attributed to pure luck. The 10 epoch tracker performed well for the entire sequence, although with ever decreasing performance, making it unclear whether it will continue to perform well for a similar, longer duration video.
5 epoch tracker
10 epoch tracker
20 epoch tracker
Figure 4.7: Images taken from a video following a car, trained for 5, 10 and 20 epochs, the objects predicted position according to the tracker is depicted by the green bounding box
42
Figure 4.8: Region overlap score for the car sequence, with the tracker trained for 5, 10 and 20 epochs, red line showing the chosen threshold at 0.5
Figure 4.7 is notably the only video that’s part of the test data where the longest trained tracker performed the best. While having a small obstacle with the trees slightly obscuring the car in the middle of the sequence, the major difficulty with this task is the significant size change of the car driving down the road. While the 5 and 10 epoch trackers both managed to follow the car for the entire duration, they both experienced problems with adapting to the size change, predicting a slightly bigger bounding box area than the initial bounding box but nowhere near the size of the car at the end. Although the 20 epoch tracker still had some problem with encapsulating the car within the bounding box for the entire duration, as can be seen in the third and fourth frame, the tracker is still deemed successful, managing to keep an accurate track of the car for the full length of the video.
In Figure 4.8 it’s shown how the region overlap score continues to decrease for the 5 and 10 epoch tracker, both unable to account for the significant size change for the sequence. The 20 epoch tracker also has a decreasing tendency
43
5 epoch tracker
10 epoch tracker
20 epoch tracker
Figure 4.9: Images taken from a video following a jogger, trained for 5, 10 and 20 epochs, the objects predicted position according to the tracker is depicted by the green bounding box
44
Figure 4.10: Region overlap score for the jogging sequence, with the tracker trained for 5, 10 and 20 epochs, red line showing the chosen threshold at 0.5
In Figure 4.9 two women are seen jogging alongside each other. Once again the longest trained tracker starts by expanding its search area, quickly losing track and flows away. Both the other trackers are performing decently, managing to keep track of the intended jogger, even through a small occlusion by a traffic light pole. The shortest trained tracker gets slightly confused right at the end, still tracking the jogger but with the potential risk of losing track. When the background changes from the distinguishable green grass to the more light, grey concrete background, the 10 epoch tracker instead switches to the other jogger.
Figure 4.11 shows how the 5 epoch tracker manages to stay above the thresh- old for the entire duration with significant oscillations but no sign of decreasing performance. The 10 epoch tracker on the other hand manages to keep good track for the most part but has a significant short dip at the beginning, before recovering. Its performance is also constantly decreasing, eventually losing track completely at the end. The 20 epoch tracker lost track multiple times and never managed to keep a high overlap score.
45
Figure 4.11: Average region overlap score for the tracker trained for 5, 10 and 20 epochs for all five scenarios with the order of the scenarios being the same as presented in the Results section, the threshold value is shown as a red dotted line and the green line represents the average for all scenarios
Unsurprisingly, after having reviewed the scenarios, the longest trained tracker at 20 epochs performed the worst, with an average region overlap slightly above 0.3. Its presence in the report is mostly justified from its unique performance for
46
the car sequence, otherwise it showed unremarkable results for a tracker. The 5 and 10 epoch trackers both achieved almost the same averages, slightly below the chosen threshold at 0.5. While the 5 epoch tracker had great performances of the ball and jogging sequence, the 10 epoch manages to outperform it both for the bicycle and car sequence. Both their averages are influenced heavily by the poor performance for the basketball sequence. The results for the 5 epoch tracker were rather unpredictable, ranging from excellent to rather poor (ex- cluding the basketball sequence) while the 10 epoch tracker didn’t manage to recieve the same long duration robustness and high region overlap as the 5 epoch tracker but were more predictable in its performance, hovering slightly above the threshold, often providing decent tracking for a larger variety of scenarios but with decreasing performance over time. For a tracker that’s supposed to be as generic as possible, working for multiple different scenarios, the 10 epoch tracker might be the prefered option.
Figure 4.12: The model loss and accuracy while being trained for 50 epochs
Figure 4.12 shows the training accuracy, aswell as the model training and validation loss when having trained for 50 epochs. The training loss continues to slightly decrease for the entire duration while the validation loss also decreases but at a significantly slower rate. This might be one of the signs that the model is getting overfitted to the training data. However, it was difficult pinpointing exactly where this overfitting started to occur. Other than the trackers trained for 5, 10 and 20 epochs evaluated in this thesis, trackers trained for 3, 7, 15 and 50 epochs were also evaluated but with insignificant results. Worth noting is that the 50 epoch tracker performed terribly on the entire test dataset, showing typical overfitting behaviour. It is likely that the models performance can be improved with a larger dataset, allowing for more training before showing signs
47
of overfitting.
4.2 Dataset
The dataset consisted of approximately 16000 annotated frames from 314 video sequences. Increasing the datasize by a factor 10 using data augmentation results in a training dataset with around 176000 annotated frames. While this is a rather large dataset that provided adequate training for the tracker to achieve satisfactory results, there is still a possibility that even greater results could be attained if the size of the training data was even larger, including far more videos, or still pictures that could be used for data augmentation. Especially since more training didn’t necessarily result in greater results. There is always an upper limit for how much training that will prove beneficial before the model gets too overfitted but the results showing how the tracker that had been trained for only 5 epochs managed to perform notably better at some situations than the tracker that had been trained for 10 epochs, which provided a slightly higher average shows that the training might be improved by using a larger and more diverse dataset.
4.3 Tracking Speed
The main idea behind using an offline tracker is the speed at which it can process images. The tracker used in this thesis managed to achieve a speed of 50 frames per second (fps) using the computer specified in the report. This speed could likely further be increased by using different optimization methods for neural networks, or by using a computer with an even faster GPU.
4.4 Tracking With the Jetbot
After reducing the size of the model, the tracker was applied to the jetbot. An image sequence of a cup, moving along a desk was first recorded. Then the tracker was applied. The trackers’ performance had been reduced significantly after having been reduced in size, as can be seen in Figure 4.13.
Figure 4.13: The reduced tracker being applied to a short image sequence fol- lowing a cup
The reduced tracker quickly loses track and the bounding box becomes stretched out. Interestingly, when downloading the images and running the original tracker on the sequence, the results were also rather poor. While the
48
performance was better than for the reduced tracker, it still had trouble fol- lowing the cup accurately. Figure 4.14 shows the original tracker being applied on the same image sequence, following the cup for a short while before also becoming stretched out. This might be an indication that the image from the Jetbot differs from the training and test datasets images. The Jetbot camera has a field of view of 135 degrees which might have an impact on the prediction of the tracker.
Figure 4.14: The original tracker being applied to a short image sequence fol- lowing a cup
The original model of the tracker had a size larger than 3GB consisting mostly of millions of weights. The vast amount of these weights originated from the first fully connected layer and the last layers of the MobileNet networks. This had initially been a mistake, where the last layer of the parallel MobileNet networks from Figure 2.20 had been removed, connecting all the nodes from the first fully connected layers to the previous layer in the MobileNet networks. This resulted in considerably more weights than previously intended but could also have proved to be crucial for the original trackers’ performance.
In order to be applied to the Jetbot, the size of the tracker had to be reduced significantly. The size reduction led to a model with worse performance. Time constraint became an issue due to the long training times of the models, ranging from 7 hours to 4 days each time. The loss and validation loss curves can be an indicator for when the model is starting to overfit to the training data but does not show when the best model has been achieved, as illustrated by the different results of the 5, 10 and 20 epoch trackers. The reduced tracker could possibly also be trained to perform well, either by trying different hyperparameters, introducing or removing layers of the network and finding a good training time. If the number of parameters for the model is crucial for its performance, the reduced model could be increased slightly, perhaps greatly if another, more optimized model is used.
49
Conclusion & Further Work
While the tracker did show satisfactory results for many situations, it still had difficulties with some harder scenarios. One of the most important aspects when it comes to training neural networks is the access to good data. This usually translates to a large and diverse dataset that can further improve training. Currently, the amount of video sequences with annotated bounding boxes are rather limited which could limit the potential for robust, accurate trackers. In the meantime, the dataset could be expanded using data augmentation on still images with annotated bounding boxes by creating the illusion of movement.
For any potential application, the training dataset aswell as the data aug- mentation algorithm could be adjusted to account for some more specific task, such as a specific set of target objects, slower or faster movement patterns aswell as a specific background. While the tracker can be considered very fast com- pared to other trackers that operates online at all time, some compromise could perhaps be achieved where the tracker is still extensively trained offline, while still allowing it to adjust some of its inner parameters during tracking, allowing for a more robust long duration tracker, while still maintaining a high speed.
The tracker had to be reduced significantly in size before being applied to the Jetbot. This was due to its original size of 3GB, while the Jetbot only had 4GB of RAM. The tracker was reduced to around 70 MB and then successfully applied to the Jetbot but performed rather poor on simple tracking scenarios. It’s unclear whether the reduced tracker can achieve similar performance as the original tracker but its performance can likely be increased by further testing. This testing can include hyperparameter tuning, increased training times, larger training dataset and model optimization.
50
Bibliography
[1] Taiwo Oladipupo Ayodele. “Machine learning overview”. In: New Ad- vances in Machine Learning (2010).
[2] SH Shabbeer Basha et al. “Impact of fully connected layers on performance of convolutional neural networks for image classification”. In: Neurocom- puting 378 (2020), pp. 112–119.
[3] Luca Bertinetto et al. “Fully-convolutional siamese networks for object tracking”. In: European conference on computer vision. Springer. 2016, pp. 850–865.
[4] Christopher M. Bishop. Pattern Recognition and Machine Learning (Infor- mation Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006. isbn: 0387310738.
[5] Francois Chollet. Deep Learning with Python. Manning, Nov. 2017. isbn: 9781617294433.
[6] Francois Chollet et al. Keras. https://keras.io. 2015.
[7] CS231n: Convolutional Neural Networks for Visual Recognition. http:
//cs231n.github.io/classification/. Accessed: 2020-02-20.
[8] Jia Deng et al. “ImageNet: a Large-Scale Hierarchical Image Database”. In: June 2009, pp. 248–255. doi: 10.1109/CVPR.2009.5206848.
[9] Dogs vs. Cats. https://www.kaggle.com/c/dogs-vs-cats. Accessed: 2020-03-20.
[10] Philipp Fischer et al. “Flownet: Learning optical flow with convolutional networks”. In: arXiv preprint arXiv:1504.06852 (2015).
[11] Ian J. Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. http://www.deeplearningbook.org. Cambridge, MA, USA: MIT Press, 2016.
[12] Google Machine Learning Crash Course. https://developers.google. com/machine-learning/crash-course. Accessed: 2020-02-20.
[13] Daniel Gordon, Ali Farhadi, and Dieter Fox. “Re3 : Real-Time Recur- rent Regression Networks for Object Tracking”. In: CoRR abs/1705.06368 (2017). arXiv: 1705.06368. url: http://arxiv.org/abs/1705.06368.
[14] Aurlien Gron. Hands-On Machine Learning with Scikit-Learn and Tensor- Flow: Concepts, Tools, and Techniques to Build Intelligent Systems. 1st. O’Reilly Media, Inc., 2017. isbn: 1491962291.
[17] Sergey Ioffe and Christian Szegedy. “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift”. In: CoRR abs/1502.03167 (2015). arXiv: 1502.03167. url: http://arxiv.org/ abs/1502.03167.
[18] Jetson Nano Developer Kit Technical Specifications. https://developer. nvidia.com/embedded/jetson-nano-developer-kit. Accessed: 2020- 03-06.
[19] Jupyter Notebook. https://jupyter.org/. Accessed: 2020-05-26.
[20] Andrej Karpathy et al. “Large-scale video classification with convolutional neural networks”. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 2014, pp. 1725–1732.
[21] Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic op- timization”. In: arXiv preprint arXiv:1412.6980 (2014).
[22] Samuel Kotz, Tomasz Kozubowski, and Krzysztof Podgorski. The Laplace Distribution and Generalizations. Jan. 2001, p. 19. isbn: 0-8176-4166-1. doi: 10.1007/978-1-4612-0173-1_5.
[23] Matej Kristan et al. “The visual object tracking vot2015 challenge re- sults”. In: Proceedings of the IEEE international conference on computer vision workshops. 2015, pp. 1–23.
[24] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “ImageNet Clas- sification with Deep Convolutional Neural Networks”. In: Advances in Neural Information Processing Systems 25. Ed. by F. Pereira et al. Cur- ran Associates, Inc., 2012, pp. 1097–1105. url: http://papers.nips.cc/ paper/4824-imagenet-classification-with-deep-convolutional-
neural-networks.pdf.
[25] Poole David L. and Mackworth Alan K. Artificial Intelligence: Founda- tions of Computational Agents. USA: Cambridge University Press, 2010. isbn: 0521519004.
[26] Jang Lee and Kwanggi Kim. “Applying Deep Learning in Medical Images: The Case of Bone Age Estimation”. In: Healthcare Informatics Research 24 (Jan. 2018), p. 86. doi: 10.4258/hir.2018.24.1.86.
[27] Martn Abadi et al. TensorFlow: Large-Scale Machine Learning on Het- erogeneous Systems. Software available from tensorflow.org. 2015. url: http://tensorflow.org/.
[28] Pamela McCorduck et al. “History of Artificial Intelligence.” In: IJCAI. 1977, pp. 951–954.
[29] Tom M. Mitchell. Machine Learning. New York: McGraw-Hill, 1997. isbn: 978-0-07-042807-2.
[30] Siddhartha Sankar Nath et al. “A survey of image classification methods and techniques”. In: 2014 International Conference on Control, Instru- mentation, Communication and Computational Technologies (ICCICCT). IEEE. 2014, pp. 554–557.
[31] Olga Russakovsky et al. “ImageNet Large Scale Visual Recognition Chal- lenge”. In: International Journal of Computer Vision (IJCV) 115.3 (2015), pp. 211–252. doi: 10.1007/s11263-015-0816-y.
[32] Mark Sandler et al. “Mobilenetv2: Inverted residuals and linear bottle- necks”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2018, pp. 4510–4520.
[33] Shibani Santurkar et al. “How does batch normalization help optimiza- tion?” In: Advances in Neural Information Processing Systems. 2018, pp. 2483– 2493.
[34] Jurgen Schmidhuber. “Deep learning in neural networks: An overview”. In: Neural networks 61 (2015), pp. 85–117.
[35] Patrice Y Simard, David Steinkraus, John C Platt, et al. “Best practices for convolutional neural networks applied to visual document analysis.” In: Icdar. Vol. 3. 2003. 2003.
[36] Arnold WM Smeulders et al. “Visual tracking: An experimental survey”. In: IEEE transactions on pattern analysis and machine intelligence 36.7 (2013), pp. 1442–1468.
[37] SparkFun JetBot AI Kit. https://www.sparkfun.com/products/retired/ 15365. Accessed: 2020-03-31.
[38] Nitish Srivastava et al. “Dropout: a simple way to prevent neural net- works from overfitting”. In: The journal of machine learning research 15.1 (2014), pp. 1929–1958.
[39] Ran Tao, Efstratios Gavves, and Arnold WM Smeulders. “Siamese in- stance search for tracking”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2016, pp. 1420–1429.
[40] The pdf of the Laplace distribution. https://commons.wikimedia.org/ wiki/File:Laplace_pdf_mod.svg. Accessed: 2020-04-03.
[41] Jason Yosinski et al. “How transferable are features in deep neural net- works?” In: CoRR abs/1411.1792 (2014). arXiv: 1411.1792. url: http: //arxiv.org/abs/1411.1792.
[42] Aston Zhang et al. Dive into Deep Learning. https://d2l.ai. 2020.
[43] Xin Zhang et al. “Object Class Detection: A Survey”. In: ACM Computing Surveys (CSUR) 46 (Oct. 2013). doi: 10.1145/2522968.2522978.
Lund University Box 118, SE-221 00 Lund, Sweden
http://www.maths.lth.se/
Contents
Introduction
Background
Methodology
Dataset

Recommended