Cognitively Inspired Real-Time Vision Core
Ozgur Yilmaz1
Department of Computer Engineering, Turgut Ozal University, Ankara, Turkey
Ismail Ozsarac
Department of Electronics Design, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,
Turkey.
Omer Gunay
Department of Electronics Design, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,
Turkey.
Huseyin Ozkan
Department of Image Processing, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,Turkey.
Abstract
We introduce a cognitively inspired novel binary image representation and utilize
it for real-time operating computer vision core, which is capable of simultane-
ously detecting a specific object in an image, classifying an image region pro-
vided by an algorithm such as motion detection, and tracking multiple objects
in a video. In this framework, hidden layer representations of binary receptive
field neural networks are utilized to generate compact image representations
for various classification based functionalities. Object detection is implemented
by learning a classifier on hidden layer activities and performing sliding win-
dow based search on the image. Morever, a classification based object tracking
algorithm is introduced that uses the proposed framework, whose tracking per-
formance in standard datasets is shown to be comparable to the state of the art
Email addresses: [email protected] (Ozgur Yilmaz),[email protected] (Ismail Ozsarac), [email protected] (Omer Gunay),[email protected] (Huseyin Ozkan)
1Corresponding author
Preprint submitted to Journal of LATEX Templates May 7, 2015
techniques. We further propose several additional functionalities on the same
core, such as sparse interest point extraction, salient motion detection, scene
recognition. These set of capabilities in the arsenal, artificial vision is expected
to perform the necessary fundamental operations in real time, paving a way
for more complex inferences, such as geometric computations and cross-modal
information fusion.
Keywords: Artificial Neural Networks, Cognitive Architecture, Computer
Vision, Local Binary Pattern, Hardware Implementation, Object
Classification, Object Tracking
1. Introduction
Neural network approaches in computer vision have become increasingly
popular due to their remarkably higher performance in complex tasks such as
large-scale classification [1] or multi-modal fusion [2]. The strength of neural net-
works is attributed to several mechanisms such as unsupervised feature learning5
from unlabeled data (compared to hand designing of features) [3], hierarchi-
cal processing via deep architectures (that discovers higher order relationships)
[4, 5] and exploitation of long-range statistical dependencies using recurrent pro-
cessing [3]. Neural networks provide a similar approach to the kernel methods
[6]: input is projected onto a high dimensional space of hidden units instead of10
basis functions, wherein a hyperplane is able to partition the data [7]. Since this
lifting to high dimension yields a powerful representation of the visual data, it
can naturally be utilized for several tasks, such as classification, detection, track-
ing, clustering, interest point detection. Thus, after an image or a video block
is ”analyzed” by a neural network via multi layer processing, the hidden layer15
activities that represent the visual input can be demultiplexed to many tasks,
in line with the processing carried out in human brain theorized in cognitive
neuroscience [8].
Neural Networks are also targeting at the real time embedded visual process-
ing needs [9], which has been continuously growing with increased demands in20
2
intelligent robotic platforms, such as Unmanned Aerial Vehicles (UAV). These
systems are expected to navigate and operate in autonomous fashion which en-
tails successful implementations of image and video understanding functions.
Scene recognition, detection of specific objects, classification of moving objects
and object tracking are some of the essential visual functionalities in an au-25
tonomous robotic system. Weight and energy specifications of such systems
restrict both the number of available functionalities and the computational
complexity of the algorithms, hence diminishes the operational capacity. We hy-
pothesize that, an embedded implementation of a visual processing core serving
as the common computational block to at least a subset of these functionalities30
can relax the aforementioned restrictions (Fig. 1). In this paper, we show that a
Field Programmable Gate Array (FPGA) implementation (see [10] for VLSI ar-
chitectures of artificial cognitive systems) of a neural network based vision block
is an efficient and effective method in embedded visual processing. Sparse and
overcomplete image representation formed in the neural network hidden layers,35
provides versatility and discriminative power [11, 12], that is harnessed through
a set of distinct processing needs. Our approach combines several computer vi-
sion modes on a common processing core, very similar to recent advances in core
video processing functions (optical flow, disparity orientation computation) [13].
Specifically, we show that object detection, classification and tracking functions40
can be executed on the same FPGA core in real-time, which can be embedded
in a robotic platform for surveillance and reconnaissance missions.
In order to propose this common architecture we propose a novel image
representation, which is the most important contribution of the paper. We in-
troduce a single layer neural network architecture with binary receptive fields45
and sparse binary responses, which greatly simplifies the computations per-
formed on FPGA compared to real valued neurons. The architecture resembles
Local Binary Patterns (LBP) [14] but with important differences in feature com-
putation. Neural network approach we adopt extends the LBP concept since
it additionally has the capability to generate convoluted binary patterns (con-50
volutional multi-layer network) and explore long-range statistical relationships
3
(recurrent processing).
In section 2, we discuss the related work on neural network based classifica-
tion and detection algorithms, classification based object tracking approaches,
local binary patterns and FPGA implementations of neural networks. The bi-55
nary receptive field based image representation is introduced in section 3. We
report our experimental findings in section 4, and highlight possible applications
of the framework by giving results on datasets. FPGA implementation of the
processing core is given in section 5. The paper concludes with the final remarks
and a discussion about several future research directions in Section 6.60
Sparse Image RepresentationImage
Analysis
ClassificationScene, Moving Object
DetectionObject, Texture
Tracking
Sparse FeaturesInterest Point and Descriptor
Figure 1: A multi purpose image analysis core creates a sparse and overcomplete image
representation that can be utilized for various applications.
2. Related Work
2.1. Object Classification and Neural Networks
Neural network algorithms have been successfully applied for object classi-
fication problems for more than 2 decades [4, 15], and the superiority in per-
formance gets more visible as the computational resources become abundant65
[1]. An alternative approach to neural networks is the design of discrimina-
tive visual features and some of the most prominent coding based methods
apply vector quantization using a learned dictionary of visual words [16, 17].
In a similar fashion, discriminative image representations are learned in neu-
ral network studies using RBMs [3], auto-encoders [18], convolutional networks70
[19, 1]. Sparsity in representation is also emphasized in neural network studies
4
[20]. The main difference between neural network and classical computer vi-
sion approaches lies in the intermediate representation: deep hierarchical and
distributed representation learned through both unsupervised pre-training and
supervised backpropagation vs. manual or semi-automatic design of visual fea-75
tures. Recently, efficient fusion of many different representations have been in-
vestigated [21], that improves upon individual hand-designed features. Despite
the existence of successful hand designed visual features, exploration of useful
features through a neural network framework shows superior performance in
recent benchmark studies [1].80
Even though deep architectures are shown to prevail in a wide range of tasks,
a single layer neural network with clustering based unsupervised learning ap-
proach can show state-of-the-art classification performance [7]. In this approach,
neural network hidden layer receptive fields are learned via k-means clustering
algorithm in an offline manner (Fig. 2a). As an efficient vector quantization85
method, K-means have been widely used for codebook generation in computer
vision literature [16], and it is also shown to be a viable alternative for learn-
ing receptive fields in a neural network [7, 22]. Learning receptive fields in a
single layer neural network shares similarities with codebook based computer
vision approaches. However neural network framework holds the potential to90
be extended to a hierarchical distributed representation with additional layers
as well as for performing recurrent processing. Overall, neural network learn-
ing provide a rich set of mechanisms for automatic discovery of discriminating
feature space, that is not available in classical computer vision feature design
frameworks.95
2.2. Local Binary Patterns and Neural Networks
Local Binary Pattern (LBP) [23, 14] is an image operator that constructs
a feature descriptor based on the texture patterns in the image defined by the
relative pixel values. LBP is shown to be successful in many texture [24] and ob-
ject classification tasks [25]. It is argued that signed differences between pixels100
convey most of the texture information in the images and has useful invariance
5
properties to intensity changes [26]. Selection of a subset of available LBP pat-
terns is shown [27] to be a good strategy since a small subset of the patterns
holds most of the discriminative power. An LBP pattern is equivalent to a spe-
cially constructed fixed binary decision tree [28] and this has been exploited for105
learning discriminative LBP like patterns from the data in a supervised frame-
work, using decision tree induction algorithms. Vector quantization methods
are used in [29, 26] to learn LBP patterns from the data in an unsupervised
manner.
2.3. FPGA Implementation of Object Classification, Tracking and Neural Net-110
works
Field Programmable Gate Array (FPGA) is widely popular in image pro-
cessing and computer vision applications due to their inherent parallelism and
low power consumption [13]. Specifically, image feature extraction algorithms
are well suited for FPGA hardware and recent advances show great potential115
for embedded computer vision [30]. Object detection tasks [31] and primary
vision operations such as binary processing [32] are implemented successfully
on FPGA chips. Recently, there is an ongoing effort to provide multi functional
vision architectures implemented in FPGAs [33].
FPGAs are also widely used in implementation of neural network approaches120
(see [9, 34] for a detailed review.). In general, neural network hardware imple-
mentations can be classified into three broad categories: DSP, ASIC and FPGA
based. DSP-based implementations are sequential overall, nevertheless the ar-
chitecture of the neurons is made as parallel as possible. ASIC implementations
are not reconfigurable, even though the hardware usage is optimized. On the125
other hand, FPGA implementations preserve the parallel architecture of the
neurons and offers flexibility in reconfiguration [34]. In [35] and [36], benefits
and obstacles of implementing a neural network in an FPGA are discussed. The
implementation of a multi-input neuron with linear/nonlinear excitation func-
tions using FPGA is analyzed in [37]. A recent study focuses on the implementa-130
tion of convolutional neural networks on FPGAs [38]. In addition to recognition
6
and detection, object tracking algorithms are also realized on FPGA chips [39].
However, to the best of our knowledge, a general neural network based FPGA
framework was not proposed before that serves as a common building block
for many visual functions such as classification, detection, tracking and interest135
point extraction.
2.4. Object Detection
Modern object detection algorithms [40] originate from sliding window foun-
dations [41] enhanced with the efficient integral image representations [42] (cf.
[43] for a recent survey on pedestrian detection). In principle, any classification140
algorithm can be utilized for object detection. In this respect, neural networks
of many different types is a strong candidate. Specifically, convolutional neural
networks are successful for face [44], hand [45] and text [46] detection applica-
tions.
2.5. Classification based Object Tracking145
Tracking can be treated as a learning/classification problem (see [47] for a
recent survey) such that, images that belong to target can be discriminated
from images that belong to background [48, 49]. In this framework, the target
object and background are projected onto a discriminative feature space and a
classifier is trained to segment target and background pixels in the subsequent150
frames. The new location of the classified target pixels reveals the motion of the
target object. This approach is advantageous over the other tracking methods
mainly because:
i. Interest point based tracking algorithms suffer when the target is too small
that [50, 51], it is hard to compute sufficiently many strong features.155
ii. Classifier can be updated during tracking for plasticity as in template
matching. However, unlike other approaches, the appearance update is able to
keep multiple instances of appearance at the same time in an effective and mem-
ory efficient way, i.e., classifier. This will make appearance adaptation smooth
and natural.160
7
iii. The representation of image regions is in a discriminative high dimen-
sional space, where tracked object can vividly be represented.
Furthermore, using a classification framework for tracking unifies the target
detection/classification and tracking computations. The same hardware/software
resources can easily be tailored to scan and detect a specific target in the whole165
image (vehicle, person etc.) or classify a region of interest (possibly marked via
motion detection) using a pre-trained classifier, and this commonalization is the
main proposition of this study.
2.6. Contributions of the Study
In this study:170
1. We generalize the Local Binary Patterns (LBP) framework using a neural
network perspective.
2. Using the novel image representation, we introduce a real-time operat-
ing computer vision core, which is capable of detecting a specific object in an
image, classifying an image region provided by another algorithm (e.g. motion175
detection), and tracking a specified object in a video.
3. Classification capability of the algorithm is utilized for real-time object
detection purposes.
4. A classification based object tracking algorithm is introduced that uses
the neural network framework, whose tracking performance in standard datasets180
is shown to be comparable to the state of the art techniques.
5. Two new aerial thermal image datasets are presented: object tracking
video set from a medium-high altitude UAV, and ship classification image set
from a low altitude UAV 2.
6. An FPGA design framework for a single layer neural network is developed185
that can be extended to multi-layer architectures.
2Please contact the corresponding author for access to these datasets.
8
(a) (b)
Image
w s
: patch xij
w: patch sizes : stride
...
yijk
K
Image Representation
(c)
Figure 2: (a) Dictionary of K (200) visual receptive fields, size w (6), learned through k-means.
Receptive fields resemble oriented Gabor filters. (b) Image patches used for computation of
hidden layer activities. Patches have size w, and sampling period is s (also called stride) which
determines the number of hidden layer neurons. (c) Image representation in the hidden neural
network layer, which has K distinct maps corresponding to different receptive fields. yijk is
the kth receptive field response at location (i, j). The representation is sparse, illustrated by
different sized black squares spread inside the first map.
3. Methods
A recent neural network framework [7] shows state-of-the-art classification
performance despite its simplicity in unsupervised pre-training (k-means) and
supervised learning (linear SVM) phases. Also, it is suitable for parallel imple-190
mentation in embedded hardware, i.e. FPGA chip, after simplifications on the
receptive fields and neural responses.
In the single layer neural algorithm [7], hidden layer activities are computed
densely at every pixel location (stride parameter s is equal to 1) in an image,
using the pre-computed receptive field dictionary (Fig. 2b). For a patch xij at
pixel location (i, j), hidden layer activities yij are computed using the Euclidian
distance function f such that
yijk = f(xij , Dk), ∀(i, j), k ∈ K,
where Dk is the kth receptive field in the dictionary, and yijk is the hidden
neural activity at location (i, j) for kth receptive field type (Fig. 2c). The re-
9
sult is a high dimensional representation of the image, ready to be utilized in195
classification tasks. For an image region, spatial pooling is applied to arrive
at a feature vector representation of the region of interest with reduced dimen-
sion. In the algorithm, the output layer of the neural network is replaced with
a linear support vector machine (SVM), to speed up both training and test
phases. Our analyses have shown that current FPGA circuits are insufficient200
for implementing this successful algorithm, hence simplifications are necessary.
3.1. Binary Receptive Fields and Neural Responses
In this study, we propose to use binary receptive fields and binary response
neural units in hidden layers of a single layer neural network. Binary response
neural unit is commonly used in artificial neural network studies, rooting from205
McCulloch and Pitts model [52], whereas binary receptive fields are uncommon
in visual processing. Analyses in FPGA resources of available chips suggests
that this simplification is necessary for the FPGA implementation. This type
of processing resembles LBP analyses of images. LBPs are nonlinear image
filters, that are equivalent to receptive field processing in neural networks [53].210
There is an essential difference between the neural network and LBP approaches:
distributed representation in neural networks vs binary coding based local rep-
resentation in LBPs. An image patch (say 4×4) is assigned to a specific pattern
(out of 216 different patterns) in LBP methods, whereas the patch is distribut-
edly represented in the hidden layer activities (K number of units) in neural215
network methods. The connection between LBPs and neural network RFs was
not established previously. There are several advantages of this novel perspec-
tive:
1.Binary patterns are represented in a distributed manner as opposed to local
representation, which provides robustness to noise and greater representational220
capacity [54].
2.Binary receptive fields in a neural network enables convoluted binary pat-
tern analyses that capture higher level image statistics via the usage of multi
layer network architectures as well as recurrent processing. A candidate archi-
10
tecture is proposed in [55] and a fast recurrent processing framework is given in225
[56].
Therfore, using binary receptive fields and units in a neural network gen-
eralizes LBP based pattern analysis approaches and enables nonlinear, multi-
layered, distributed and recurrent binary pattern analyses. Additionally, this
approach allows for real-time implementation of the neural network, otherwise230
not possible.
Figure 3: Binary receptive fields (K = 100) learned through k-means clustering. The receptive
fields capture edge, corner, line segment and some other complex patterns in images
Binary receptive fields are learned through k-means clustering on a suffi-
ciently large number of (order of millions) randomly cropped image patches
(Fig. 3). The image patches are first binarized by subtracting the mean from
the pixel values and applying thresholding function, then k-means is performed235
on binarized pathces. The cluster centers generated by k-means are also bina-
rized to compute binary receptive fields. The binary receptive fields capture
edge, corner, line segment and some other complex patterns in images.
The hidden layer activities, yij at pixel location (i, j) are computed using
the binarized image patch xij of size w × w (see Fig. 2) and binary receptive
fields Dk (Fig. 3), using the Hamming distance, h:
yijk = h(xij , Dk), ∀(i, j) and k ∈ K.
11
where Hamming distance function is defined as follows
h(a, b) =
K∑k=1
‖ak − bk‖
We also binarize the hidden layer activities with sparsity enforcing soft assign-
ment:
yij = max[0, sign(µ(yij)− yij)],
where µ is the corresponding averaging function over dimension k:
µ(yij) =1
K
K∑k=1
yijk
Binarization of receptive fields and image patches enables Hamming distance
instead of Euclidian distance for computing the activation of the hidden layer240
neurons, which reduces the computational complexity. Additionally, the hidden
layer activities are further binarized to reduce the memory requirements. The
hidden layer activities yij at pixel location (i, j) is a binary feature vector of size
K. The pattern of an image patch is represented distributedly in the network
with this sparse binary feature vector. The sparsification sets roughly half of245
the hidden neurons to zero, and represents a simple form of competition.
The binary hidden layer activities yij are computed for all the pixels (i, j) in
the acquired image (Fig. 4). Classification of an image region requires feature
computation. For computing a feature representation of a specific image region,
spatial pooling is applied for dimensionality reduction. Suppose the rectangular
image region S (size M × N) is centered at (u, v) (Fig. 4), then the feature
vector v of this region is:
v = Q(yij), ∀(i, j) ∈ S,
where Q is the spatial pooling function that sums the activities of hidden layers
at each quadrant (Q1, Q2, Q3 and Q4), and concatenates the sums to obtain
the feature vector v of size 4K:
Q1 =
M2∑
i=1
N2∑
j=1
yij , Q2 =
M∑i=M
2 +1
N2∑
j=1
yij ,
12
K
X
Y
Hidden Layer Representation
S M
N
Q1 Q2
Q3 Q4(u,v)
Figure 4: The spatial pooling operation applied on a region of interest S of size M × N in
the image. The hidden layer representation is computed for the whole image of size X × Y .
The hidden layer activities are summed over each quadrant (Q1 to Q4), then obtained vectors
(the sums) are concatenated to obtain a feature vector representation of the region.
Q3 =
M2∑
i=1
N∑i=N
2 +1
yij , Q4 =
M∑i=M
2 +1
N∑i=N
2 +1
yij ,
Q = [Q1;Q2;Q3;Q4] .
A linear SVM classifier (L2 norm) is trained (offline for object detection, online
for tracking) using feature vectors of images, v. In object detection tasks, multi-
scale sliding window based search is performed. Images are downsampled for
search in larger scales to speed up processing.250
4. Experiments
In the experiments we examined:
1. object classification and detection performance drop due to algorithmic
simplification,
2. performance difference between the single layer binary neural network255
algorithm and the local binary patterns,
3. usage of the framework for different applications, details of which are
given in a separate supplementary dcoument.
13
4.1. Classification with Binary Receptive Fields and Responses
The3 receptive fields and hidden layer neural responses are binarized for260
reducing computational and memory costs of the algorithm. This simplification
enables implementation of a real-time operating classification algorithm on a
currently available FPGA chip. We observe that classification performance of
the simplified algorithm does not change dramatically (Fig. 5), even though
the computational complexity and memory load are greatly reduced. In Fig.265
5, we show the classification accuracy of the original, binary receptive field and
binary receptive field + binary neural response algorithms on CIFAR10 dataset
[57]. We obtain significant computational improvement and hence real time
operation only for a 10% percent accuracy drop (full binarization that is Binary
2 in Fig. 5). The reason for this moderate drop in classification performance270
is due to the previous observations pointed out in local binary pattern studies
[26]: signed differences between pixels convey most of the texture information
in the images.
100 200 40045
50
55
60
65
70
Number of RFs (K)
Acc
ura
cy (
%)
RealBinary 1Binary 2
Figure 5: Classification performance on CIFAR 10 dataset as a function of cluster sizes, for
real neurons (original algorithm) and for the two types of binarization in the network.
3The parameters used in all experiments are given in Appendix.
14
Figure 6: Classification performance comparison of LBP and Binary RF Neural Network
approaches on on CIFAR 10 dataset and Flicker Material Database (FMD) datasets. The
numbers in parentheses after the dataset names are the number of spatial blocks in the image
used for summation (histogram calculation for LBP).
4.2. Comparison with Local Binary Patterns
We propose that utilization of binary receptive fields introduces binary pat-275
tern analyses using neural networks. This perspective provides a generalization
of LBP, enabling distributed, multi-layered, recurrent computation, which is
non-existent in LBP studies. In this section we compare the performance of
single layer neural network with rotation invariant LBP descriptor [14, 58], and
show that they are comparable. The size of the LBP filter is 3by3, there are280
total of 58 filters, and the feature vector is formed by taking a histogram over
four image quadrants, similar to the neural network algorithm. The neural net-
work receptive field size and parameter K is chosen accordingly, receptive field
is binary but neural response is real. Thus the comparison between the algo-
rithms is completely fair. The difference mainly lies in the sparse distributed vs285
local representation of the binary pattern, and rotation invariance property of
LBPs which does not exist in neural network hidden layer activities. The clas-
sification performance of the two algorithms are compared using on CIFAR 10,
Flicker Material Database [59] 4 and PASCAL VOC datasets (person subset).
4Texture classification is the strong suit of LBP algorithms, and Flicker Material Database
is considered as one of the hardest classification datasets. 75 percent of the data is used for
15
The results are given in Figure 6. It is observed that LBP and binary receptive290
field single layer neural network gives similar performance for the three datasets.
However, neural network algorithm has vast amount of room for improvement,
such as addition of layers, introduction of recurrent connections etc. The dif-
ference in potential for improvement is crucial for the comparison of the two
frameworks. Overall, classification performance is close to the state-of-the-art295
for FMD and VOC datasets, but there is a larger gap for CIFAR 10 dataset and
this underfitting can be attributed to the usage of very few number of receptive
fields (i.e. 58) compared to other algorithms.
4.3. Applications of the Framework
We applied the framework for classification, detection and tracking purposes.300
The details of the experiments are given in a supplementary material document.
We examine the tracking performance on standard datasets, report state-of-the-
art results. Ship classification, detection from aerial images and thermal track-
ing applications are specifically designed for RECONSURVE Project (ITEA 2),
and the datasets are provided to other researchers.305
5. FPGA Implementation
5.1. IP Core
We have done the preliminary FPGA design 5 for the proposed algo-
rithm. The implementation in FPGA is realized as an IP core (Fig. 7). Due to
the inherent parallelism:310
1. Feature volume (X × Y ×D) (Fig. 2c) can be constructed for the whole
video frame.
training. Image resolution is 128 by 128.5The amount of detail we provide in this paper is enough have an FPGA implementation.
For our case, VHDL coding and more detailed design will be performed after the circuit board
is finalized.
16
MEMORY INTERFACE
FEATURE
EXTRACTION
FEATURE
SUMMATION
CLASSIFICATION
FEATURE
VECTORS
CLASS
LABELS
IMAGE ANALYZER
VIDEO
IP CORE
FEATURE
DICTIONARY
CLASS
MATRIX
FEATURE
CALCULATION
REQUESTS
Figure 7: FPGA IP core structure. The memory interface handles data read/write operations
with the external memories. There are three main stages of the Image Analyzer block. Feature
dictionary and classification matrix is pre-uploaded into the FPGA memory. Video frames
and feature calculation requests are constantly feed into the core during operation.
2. Feature vector v can be calculated for many image regions (i.e. sliding
windows for object detection) simultaneously.
3. The calculated feature vectors can be classified in parallel.315
The parallelism supplied by the FPGA core is exploited for search based
object detection and tracking applications. After computing the feature volume
(Fig. 2c), FPGA receives coordinates of image regions from a CPU, which are
queries and need to be classified according to the pre-uploaded classifier (ma-320
trix). FPGA outputs the class labels of the regions of interest (e.g. sliding
window object detection queries or object tracking queries), and also the fea-
ture vectors of the query regions. The core consists of two main sub-blocks:
Image Analyzer and Memory Interface. Memory Interface is responsible for
data transfer between Analyzer and external memories. Image Analyzer block325
consists of three sub-blocks (Fig. 7): Feature Extraction, Feature Summation
and Classification. Image Analyzer block receives four types of inputs; video,
feature dictionary, class matrix and feature calculation requests. Feature dictio-
17
nary and classification matrix is pre-uploaded into the FPGA memory. Video
frames and feature calculation requests are constantly fed into the core during330
operation. The resolution of the video is X (row) by Y (column) and the frame
rate is considered to be higher than 10 Hz 6.
FPGA
STRATIX V
( 5SGXA7 )
DDR3 SDRAM
64 Bit
@ 600MHz
CPU2.5Gbps
PCIe
DEDICATED HARDWARE
Figure 8: The algorithm is implemented on a dedicated hardware using Stratix V (5SGXA7)
ALTERATMFPGA chip. It consists of memory, CPU and the FPGA chip, providing fast
communication among the components.
5.2. Hardware Usage and Timing
Preliminary FPGA design of the algorithm is done on a Stratix V (5SGXA7)
ALTERATMbased dedicated hardware (Fig. 8). Table 1 shows the FPGA and335
hardware resource usage7 of the object detection implementation for 640*512
@50Hz video rate, RF size 4 pixels, 20,000 numbers of 64 × 64 pixels sliding
windows, which corresponds to at least 5 different scales (1.3, 1.2, 1, 0.8, 0.6)
of exhaustive object search with 8 pixels shift between windows. Therefore,
an effective and fast (> 10 Hz) embedded object detection framework can be340
constructed using a fraction of an FPGA chip, that is to be deployed on UAVs.
Notice that, object detection, object tracking (> 20 objects) can be simultane-
ously executed using less than 20% of the FPGA resources, the rest of which
can be utilized for other tasks such as salient motion detection, sparse feature
6See supplementary materials for the details.7See supplementary materials for the detailed timing and hardware usage analysis.
18
extraction and visual odometry. Saliency of the detected motion/change [60]345
can be determined using the same classification framework. Moving pixels can
be analyzed for saliency (pre-determined object class detection) in a multi-scale
manner via appropriate CPU-FPGA communication, for less than 10% of the
FPGA resources. Sparse feature extraction requires further computation on hid-
den layer activities of individual pixels, however the computational complexity350
of this additional stage is predicted to be low.
The timing of calculations and delay analysis shows that the system is oper-
ating in real time without any frame delay. More importantly, for less than 30%
of the FPGA resources, object detection, object tracking (multiple), salient mo-
tion detection and scene recognition can be executed in a satisfactory accuracy355
(see Fig. 5). Sparse feature extraction needs to be worked out and analyzed to
understand its resource usage.
It should be noted that the overall performance of the FPGA friendly algo-
rithm is low compared to the state-of-the-art [61, 62] performance in CIFAR 10
dataset (see 5.2). However the computational load of these algorithms is very360
high due to sparsity enforcing regularization techniques that are utilized. Real-
time operation of these algorithms in an embedded hardware does not seem
realistic. Our analyses show that, even with one of the latest most powerful
chips (Stratix V), it is not possible to use real valued neurons for the high per-
forming single layer convolutional network [7]. There are two major reasons365
for this restriction: 1. The number of multipliers needed for the hidden layer
neuron activity computation for a real valued neuron can not be supplied by
the FPGA. But bitwise OR that we utilize is an abundant resource. 2. External
memory bandwidth is almost fully occupied when a real valued neuron is used.
However, we decrease the usage to one eight by binarization. Additionally, in370
commercial products a high end FPGA such as Stratix V is not an economical
choice, then the restrictions become even harsher.
Nevertheless, the performance of our system can be improved by using a
larger number of receptive fields hence more FPGA resources, however it should
be noted that FPGA resources will most likely be shared with other processing375
19
threads in an embedded vision system. Therefore, the performance-resource
trade-off should be considered when choosing the number of receptive fields.
Finally, even though there are other parallel processing options such as ASIC
or DSP, we are unable to compare our design for accuracy and speed with these
options, mainly because we couldn’t find studies that reported performance on380
challenging datasets such as CIFAR 10. GPU-CPU performance is reported
in [61] but it is not an embedded hardware suitable for robotic applications.
In summary, performance-wise comparison of our FPGA design with existing
hardware architectures does not seem possible, however we will compare the
performance of our algorithms with literature on algorithm design.385
Table 1: FPGA and Hardware Resource Usage Summary
Property Used Available Occupied (%)
Logic 30,000 622,000 5
Internal RAM 250 × M20K 2560 × M20K 10
Multiplier (18 × 18) 40 512 8
Ext. Memory Bandw. 6 Gps 75 Gps 8
CPU Interface Bandw. 0.1 Gbps 2.5 Gbps 4
6. Discussion and Future Work
In this paper, we provided an embedded visual processing core that is able
to serve as the main machinery for various applications a robotic platform needs
for artificial vision. The commonalization of different functions is achieved by
an analysis stage, as it is done in primary visual cortex. The overcomplete and390
sparse representation of the small image patches due to neural network recep-
tive fields allows for efficient and discriminative description for larger regions
of interest. We show that classification, detection, tracking can be done on
the same core in real time, using a fraction of FPGA chip resources. We fur-
ther propose several additional functionalities on the same core as future work,395
such as sparse interest point extraction, salient motion detection, scene recog-
20
nition. These set of capabilities in the arsenal, artificial vision is expected to
perform the necessary fundamental operations on the image in real time on a
low power chip, paving a way for more complex inferences, such as geometric
computations (3D reconstruction, stitching, visual odometry) and cross-modal400
information fusion (scene-object recognition, segmentation-recognition, scene
geometry-texture/reflectance, dorsal-ventral [63]).
Furthermore we show that the binary receptive field approach we propose
generalizes the successful Local Binary Pattern (LBP) features, and enables non-
linear, multi-layered, distributed and recurrent binary pattern analyses. This405
novel perspective on binary patterns needs further investigation.
We introduced a classification based tracking algorithm and its FPGA im-
plementation, that shows good performance in standard datasets. There are
several advantages of this approach over the others:
1. In classification based tracking, a high level object search can be utilized410
in case of occlusion if the tracked object can be labeled as an apriori known
class (tree) (vehicle → truck → brand or animal → dog → terrier). Therefore,
in this tracking framework it is possible to come to a high level description of
the tracked object and use it to recover tracking in case of failures (see online
update in multi-task tracking [64]).415
2. This tracking framework can easily be tailored for active visual learning
[65], in which a learning system will become enabled to sample multiple views
of the same object via tracking.
Why did we focus on a specific single layer neural network algorithm [7]? It
is simple to train and successful in object classification, hence it provides a good420
edge in Occam’s razor. However, it is still a starting point after which there are
several ways of improvement. We plan to enhance the capability with sparse
feature extraction, scene recognition, salient motion detection add-ons, none
of which are predicted to occupy large valuable embedded space or processing
time. Some other future directions are:425
1. Having more convolutions: single layer architecture of our system can be
extended to have multiple convolutional layers [22, 15, 38]. In fact, the true
21
power of neural network approaches are not tapped if only a single layer net-
work is used, even though it is shown to perform very well in classification[7].
However, we propose that the raison d’etre of convolutional layers in human430
visual cortex is the diversity of tasks that need to be performed for ecological
vision. Some of these tasks require precise localization and low level feature
analysis (e.g. simultaneous classification and pose estimation), whereas others
require generalization power and higher order statistical dependence (e.g. emo-
tion recognition). Therefore the “output layer” is not always the last layer of435
the network, but it should be varying with the requirements of the task. For the
first class of tasks, lower layer network activities are appropriate and the second
class of tasks require activities of invariant detectors residing in higher layers.
Some tasks even might require much more complicated cascade, one layer af-
ter another. Therefore, we envision a demultiplexing approach that probes and440
routes the activities of a layer for a specific task. Theoretical and empirical
exploration of this matching between layers and tasks is important for building
a multi-purpose neural network architecture. It is also essential to note that,
the holy grail of this kind of perspective is recurrent procesing in which high
level hypotheses are fed back into lower layers for consistency in a theoreti-445
cally tractable manner. The authors are currently working on recurrent neural
networks that are suited for the FPGA design paradigm we adopted.
2. Using a volume of images for having motion sensitive receptive fields
and video processing (activity recognition, motion detection, classification based
egomotion estimation etc.) This second block will serve as the motion processing450
pathway of our system and the famous cortical duality (Magno-Parvo, Dorsal-
Ventral [66]) is going to be completed, ready to enrich the system with inter-
pathway interactions [63].
Good features to track and good descriptors to match are two fundamental
problems in computer vision [67, 68]. Sparse feature detection, description and455
matching framework enables fast and accurate algorithms for several applica-
tions such as image alingment, 3D reconstruction, visual odometry and object
recognition. There are efficient FPGA implementations of successful features
22
[30], however overcomplete and discriminative neural network representation
should be sufficient to define keypoints and descriptors. There is an ongoing460
research for cognitively inspired feature extraction methods [69]. Indeed, the
authors conducts an ongoing research on performing keypoint detection and de-
scription on the hidden layer representation of images (Fig. 1 and 2c), which
will provide the capability to extract sparse features to be used for homogra-
phy estimation as well as visual odometry. We predict that the detection and465
extraction computations will occupy small amount of additional space and time.
7. Conclusion
We predict that in the near future, FPGA implementation of neural network
algorithms will enable high performance, multi-functional and modular process-
ing architectures. In this undertake, it is important to seek for the most compact470
and powerful neural representation of the visual data, as well as its versatility
in several different tasks. Moreover, dual pathway theory of the visual cortex
should be considered in designing artificial systems, since cortical architecture
has been an inspiration to many successful algorithms.
Appendix A. Parameters475
The common parameters in all experiments is as follows. The patch (or
called receptive field) size w is set to 4 pixels. The number of random patches
extracted for unsupervised receptive field learning is one million. Maximum
number of iterations in SVM training is one thousand.
In crossroad detection experiments, number of RFs (K) is 800. Scale space480
consists of 4 images with scale factors 1.2x, 1x, 0.8x and 0.6x. The sliding
window sampling step is 8 pixels at all scales. The dataset images are all resized
to 32× 32 pixels.
The ship classification experiments use 100 RFs and w is 4 and the dataset
images are all resized to 64× 64 pixels.485
23
In object tracking experiments K is 100 and w is 4. All the target and back-
ground images are resized to 16× 16 pixels. Plus/minus 2 pixel shifted versions
of the target are also considered as target, summing to a total of 25 images for
target class. The background images are cropped from the background region
according to a sampling pattern defined by the following shift vector (plus and490
minus): [3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 22 24 26 28 30 32 35 38 41
45]. Therefore, there are 3025 images for background class. Maximum number
of iterations in SVM training is 10, however it is 100 in the first frame. If the
projection of the newly trained hyperplane on the previous hyperplane is less
than 0.9, the training is rejected. During detection, the target is searched in 30495
pixel neighborhood of the previously known location. A very simple filtering is
applied to tighten target matching: if the projection of the feature vector onto
the hyperplane is less than a threshold (0.1), detection is rejected. Furthermore,
2 target detections are required in order to resume tracking, otherwise track lost
is declared.500
In experiments where we compare LBP and neural network performances,
the parameters are given in the text.
Acknowledgment
This research is supported by The Scientific and Technological Research
Council of Turkey (TUBITAK) Career Grant, No: 114E554. We would like505
to thank RECONSURVE project (ITEA 2) for providing the ship dataset, and
also ASELSAN Inc. for providing the thermal tracking and crossroad detection
datasets.
References
[1] A. Krizhevsky, I. Sutskever, G. Hinton, Imagenet classification with deep510
convolutional neural networks, in: Advances in Neural Information Pro-
cessing Systems 25, 2012, pp. 1106–1114.
24
[2] N. Srivastava, R. Salakhutdinov, Multimodal learning with deep boltzmann
machines, in: Advances in Neural Information Processing Systems 25, 2012,
pp. 2231–2239.515
[3] G. E. Hinton, S. Osindero, Y.-W. Teh, A fast learning algorithm for deep
belief nets, Neural computation 18 (7) (2006) 1527–1554.
[4] Y. LeCun, L. Bottou, Y. Bengio, P. Haffner, Gradient-based learning ap-
plied to document recognition, Proceedings of the IEEE 86 (11) (1998)
2278–2324.520
[5] M. Ranzato, F. J. Huang, Y.-L. Boureau, Y. Lecun, Unsupervised learning
of invariant feature hierarchies with applications to object recognition, in:
Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Confer-
ence on, IEEE, 2007, pp. 1–8.
[6] B. Scholkopf, A. J. Smola, Learning with kernels, The MIT Press, 2002.525
[7] A. Coates, A. Y. Ng, H. Lee, An analysis of single-layer networks in unsu-
pervised feature learning, in: International Conference on Artificial Intelli-
gence and Statistics, 2011, pp. 215–223.
[8] T. S. Lee, D. Mumford, R. Romero, V. A. Lamme, The role of the primary
visual cortex in higher level vision, Vision research 38 (15) (1998) 2429–530
2454.
[9] J. Misra, I. Saha, Artificial neural networks in hardware: A survey of two
decades of progress, Neurocomputing 74 (1) (2010) 239–255.
[10] G. Indiveri, E. Chicca, R. J. Douglas, Artificial cognitive systems: from vlsi
networks of spiking neurons to neuromorphic cognition, Cognitive Compu-535
tation 1 (2) (2009) 119–127.
[11] Y.-L. Boureau, F. Bach, Y. LeCun, J. Ponce, Learning mid-level features
for recognition, in: Computer Vision and Pattern Recognition (CVPR),
2010 IEEE Conference on, IEEE, 2010, pp. 2559–2566.
25
[12] N. W. Tay, C. K. Loo, M. Perus, Face recognition with quantum associative540
networks using overcomplete gabor wavelet, Cognitive Computation 2 (4)
(2010) 297–302.
[13] M. Tomasi, M. Vanegas, F. Barranco, J. Daz, E. Ros, Massive parallel-
hardware architecture for multiscale stereo, optical flow and image-
structure computation, Circuits and Systems for Video Technology, IEEE545
Transactions on 22 (2) (2012) 282–294.
[14] T. Ojala, M. Pietikainen, T. Maenpaa, Multiresolution gray-scale and ro-
tation invariant texture classification with local binary patterns, Pattern
Analysis and Machine Intelligence, IEEE Transactions on 24 (7) (2002)
971–987.550
[15] T. Serre, L. Wolf, S. Bileschi, M. Riesenhuber, T. Poggio, Robust object
recognition with cortex-like mechanisms, Pattern Analysis and Machine
Intelligence, IEEE Transactions on 29 (3) (2007) 411–426.
[16] L. Fei-Fei, P. Perona, A bayesian hierarchical model for learning natu-
ral scene categories, in: Computer Vision and Pattern Recognition, 2005.555
CVPR 2005. IEEE Computer Society Conference on, Vol. 2, IEEE, 2005,
pp. 524–531.
[17] M. Jiu, C. Wolf, C. Garcia, A. Baskurt, Supervised learning and codebook
optimization for bag-of-words models, Cognitive Computation 4 (4) (2012)
409–419.560
[18] P. Vincent, H. Larochelle, Y. Bengio, P.-A. Manzagol, Extracting and com-
posing robust features with denoising autoencoders, in: Proceedings of the
25th international conference on Machine learning, ACM, 2008, pp. 1096–
1103.
[19] K. Jarrett, K. Kavukcuoglu, M. Ranzato, Y. LeCun, What is the best multi-565
stage architecture for object recognition?, in: Computer Vision, 2009 IEEE
12th International Conference on, IEEE, 2009, pp. 2146–2153.
26
[20] H. Lee, C. Ekanadham, A. Ng, Sparse deep belief net model for visual
area v2, in: Advances in neural information processing systems, 2007, pp.
873–880.570
[21] J. Yu, Y. Rui, Y. Tang, D. Tao, High-order distance-based multiview
stochastic learning in image classification., IEEE transactions on cyber-
netics 44 (12) (2014) 2431.
[22] A. Coates, A. Ng, Selecting receptive fields in deep networks, in: Advances
in Neural Information Processing Systems, 2011, pp. 2528–2536.575
[23] T. Ojala, M. Pietikainen, D. Harwood, A comparative study of texture
measures with classification based on featured distributions, Pattern recog-
nition 29 (1) (1996) 51–59.
[24] Z. Guo, L. Zhang, D. Zhang, A completed modeling of local binary pattern
operator for texture classification, Image Processing, IEEE Transactions580
on 19 (6) (2010) 1657–1663.
[25] A. Suruliandi, K. Meena, R. Reena Rose, Local binary pattern and its
derivatives for face recognition, Computer Vision, IET 6 (5) (2012) 480–
488.
[26] T. Ojala, K. Valkealahti, E. Oja, M. Pietikainen, Texture discrimination585
with multidimensional distributions of signed gray-level differences, Pattern
Recognition 34 (3) (2001) 727–739.
[27] M. Topi, O. Timo, P. Matti, S. Maricor, Robust texture classification by
subsets of local binary patterns, in: Pattern Recognition, 2000. Proceed-
ings. 15th International Conference on, Vol. 3, IEEE, 2000, pp. 935–938.590
[28] D. Maturana, D. Mery, A. Soto, Face recognition with decision tree-based
local binary patterns, in: Computer Vision–ACCV 2010, Springer, 2011,
pp. 618–629.
27
[29] T. Ahonen, M. Pietikainen, Image description using joint distribution of
filter bank responses, Pattern Recognition Letters 30 (4) (2009) 368–376.595
[30] T. Chang, L. Chiu, J. Chen, N. Chang, Fast sift design for real-time visual
feature extraction, Image Processing, IEEE Transactions on.
[31] N. Farrugia, F. Mamalet, S. Roux, F. Yang, M. Paindavoine, Fast and
robust face detection on a parallel optimized architecture implemented on
fpga, Circuits and Systems for Video Technology, IEEE Transactions on600
19 (4) (2009) 597–602.
[32] B. Zhang, K. Mei, N. Zheng, Reconfigurable processor for binary image
processing, Circuits and Systems for Video Technology, IEEE Transactions
on 23 (5) (2013) 823–831.
[33] C. Desmouliers, E. Oruklu, S. Aslan, J. Saniie, F. Vallina, Image and video605
processing platform for field programmable gate arrays using a high-level
synthesis, IET Computers & Digital Techniques 6 (6) (2012) 414–425.
[34] A. R. Omondi, J. C. Rajapakse, FPGA implementations of neural networks,
Vol. 365, Springer New York, NY, USA:, 2006.
[35] B. Girau, Neural networks on fpgas: a survey.610
[36] L. P. Maguire, T. M. McGinnity, B. Glackin, A. Ghani, A. Belatreche,
J. Harkin, Challenges for large-scale implementations of spiking neural net-
works on fpgas, Neurocomputing 71 (1) (2007) 13–29.
[37] A. Muthuramalingam, S. Himavathi, E. Srinivasan, Neural network im-
plementation using fpga: issues and application, International journal of615
information technology 4 (2) (2008) 86–92.
[38] C. Farabet, Y. LeCun, K. Kavukcuoglu, E. Culurciello, B. Martini, P. Ak-
selrod, S. Talay, Large-scale FPGA-based convolutional networks, Cam-
bridge, UK: Cambridge University Press, 2011.
28
[39] J. Schlessman, C.-Y. Chen, W. Wolf, B. Ozer, K. Fujino, K. Itoh, Hard-620
ware/software co-design of an fpga-based embedded tracking system, in:
Computer Vision and Pattern Recognition Workshop, 2006. CVPRW’06.
Conference on, IEEE, 2006, pp. 123–123.
[40] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, D. Ramanan, Object de-
tection with discriminatively trained part-based models, Pattern Analysis625
and Machine Intelligence, IEEE Transactions on 32 (9) (2010) 1627–1645.
[41] C. Papageorgiou, T. Poggio, A trainable system for object detection, In-
ternational Journal of Computer Vision 38 (1) (2000) 15–33.
[42] P. Viola, M. J. Jones, Robust real-time face detection, International journal
of computer vision 57 (2) (2004) 137–154.630
[43] P. Dollar, C. Wojek, B. Schiele, P. Perona, Pedestrian detection: An eval-
uation of the state of the art, Pattern Analysis and Machine Intelligence,
IEEE Transactions on 34 (4) (2012) 743–761.
[44] C. Garcia, M. Delakis, Convolutional face finder: A neural architecture for
fast and robust face detection, Pattern Analysis and Machine Intelligence,635
IEEE Transactions on 26 (11) (2004) 1408–1423.
[45] S. J. Nowlan, J. C. Platt, A convolutional neural network hand tracker,
Advances in Neural Information Processing Systems (1995) 901–908.
[46] M. Delakis, C. Garcia, text detection with convolutional neural networks.,
in: VISAPP (2), 2008, pp. 290–294.640
[47] Q. Liu, X. Zhao, Z. Hou, Survey of single-target visual tracking methods
based on online learning, IET Computer Vision.
[48] S. Avidan, Ensemble tracking, Pattern Analysis and Machine Intelligence,
IEEE Transactions on 29 (2) (2007) 261–271.
29
[49] B. Babenko, M.-H. Yang, S. Belongie, Robust object tracking with on-645
line multiple instance learning, Pattern Analysis and Machine Intelligence,
IEEE Transactions on 33 (8) (2011) 1619–1632.
[50] D. Serby, E. Meier, L. Van Gool, Probabilistic object tracking using multi-
ple features, in: Pattern Recognition, 2004. ICPR 2004. Proceedings of the
17th International Conference on, Vol. 2, IEEE, 2004, pp. 184–187.650
[51] O. Yilmaz, Oscillatory synchronization model of attention to moving ob-
jects, Neural Networks 29 (2012) 20–36.
[52] W. S. McCulloch, W. Pitts, A logical calculus of the ideas immanent in
nervous activity, The Bulletin of Mathematical Biophysics 5 (4) (1943)
115–133.655
[53] T. S. Lee, Image representation using 2d gabor wavelets, Pattern Analysis
and Machine Intelligence, IEEE Transactions on 18 (10) (1996) 959–971.
[54] G. Hinton, J. McClelland, D. Rumelhart, Distributed representations, in:
Parallel distributed processing: explorations in the microstructure of cog-
nition, vol. 1, MIT Press, 1986, pp. 77–109.660
[55] O. Yilmaz, Connectionist-symbolic machine intelligence using cellular
automata based reservoir-hyperdimensional computing, arXiv preprint
arXiv:1503.00851.
[56] O. Yilmaz, Classification of occluded objects using fast recurrent process-
ing, under review, Neural Computing and Applications.665
[57] A. Krizhevsky, G. Hinton, Learning multiple layers of features from tiny
images, Master’s thesis, Department of Computer Science, University of
Toronto.
[58] A. Vedaldi, B. Fulkerson, VLFeat: An open and portable library of com-
puter vision algorithms, http://www.vlfeat.org/ (2008).670
30
[59] L. Sharan, R. Rosenholtz, E. Adelson, Material perception: What can you
see in a brief glance?, Journal of Vision 9 (8) (2009) 784–784.
[60] N. Goyette, P. Jodoin, F. Porikli, J. Konrad, P. Ishwar, Changedetection.
net: A new change detection benchmark dataset, in: Computer Vision and
Pattern Recognition Workshops (CVPRW), 2012 IEEE Computer Society675
Conference on, IEEE, 2012, pp. 1–8.
[61] L. Wan, M. Zeiler, S. Zhang, Y. L. Cun, R. Fergus, Regularization of neu-
ral networks using dropconnect, in: Proceedings of the 30th International
Conference on Machine Learning (ICML-13), 2013, pp. 1058–1066.
[62] I. J. Goodfellow, D. Warde-Farley, M. Mirza, A. Courville, Y. Bengio,680
Maxout networks, arXiv preprint arXiv:1302.4389.
[63] T. Schenk, R. D. McIntosh, Do we have independent visual streams for
perception and action?, Cognitive Neuroscience 1 (1) (2010) 52–62.
[64] H. Liu, F. Sun, Y. Yu, Multitask extreme learning machine for visual track-
ing, Cognitive Computation 6 (3) (2014) 391–404.685
[65] B. Settles, Active learning literature survey, University of Wisconsin, Madi-
son.
[66] M. A. Goodale, A. D. Milner, Separate visual pathways for perception and
action, Trends in neurosciences 15 (1) (1992) 20–25.
[67] K. Mikolajczyk, C. Schmid, A performance evaluation of local descriptors,690
Pattern Analysis and Machine Intelligence, IEEE Transactions on 27 (10)
(2005) 1615–1630.
[68] K. Mikolajczyk, T. Tuytelaars, C. Schmid, A. Zisserman, J. Matas,
F. Schaffalitzky, T. Kadir, L. Van Gool, A comparison of affine region
detectors, International journal of computer vision 65 (1-2) (2005) 43–72.695
31
[69] S. Kim, S. Kwon, I. S. Kweon, A perceptual visual feature extraction
method achieved by imitating v1 and v4 of the human visual system, Cog-
nitive Computation 5 (4) (2013) 610–628.
32
Supplementary Material: Applications of theFramework
Ozgur Yilmaz1
Department of Computer Engineering, Turgut Ozal University, Ankara, Turkey
Ismail Ozsarac
Department of Electronics Design, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,
Turkey.
Omer Gunay
Department of Electronics Design, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,
Turkey.
Huseyin Ozkan
Department of Image Processing, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,Turkey.
Abstract
In this supplementary material we provide the details of the experiments we did
to demonstrate different applications of the proposed framework.
1. Detection of Crossroads from Aerial Images
We have tested the ability of the proposed system for object detection from
high altitude aerial images. This is an essential capability for an autonomous
UAV, in tandem with tracking of detected objects. The sacrifice in detection
capability due to simplification of FPGA neural network algorithm needs to be5
tested. Crossroads objects are used for the tests for which detection is a chal-
lenging task due to appearance variability, clutter and occlusion. Even though
Email addresses: [email protected] (Ozgur Yilmaz),[email protected] (Ismail Ozsarac), [email protected] (Omer Gunay),[email protected] (Huseyin Ozkan)
1Corresponding author
Preprint submitted to Journal of LATEX Templates May 7, 2015
it is an important source of information for road detection/tracking purposes,
it is not exploited often enough in these studies (see [1]). 600 annotated cross-
road images were used in the training and the positive image set was further10
populated by including artificially rotated, scaled and translated versions of the
crossroad images [2]. There are about 120,000 crossroad class instances, and
250,000 background class instances randomly cropped from the images2. Every
image in the training set was first resized to a 32×32 pixel square. The receptive
fields are learned using k-means on this dataset of resized images; then feature15
vectors, v, are extracted with the learned receptive fields. A linear SVM classi-
fier is trained using the set of labeled feature vectors. The tests were performed
on 13 separate images of size 1000 × 1000, using multi-scale (4 scales: 1.2, 1,
0.8, 0.6 scale factors) sliding window method. We use conservative criteria for
precision-recall measures such that, a declared detection by the algorithm is20
accepted as correct detection if its center falls inside a crossroad object region,
and a crossroad object is detected if its center falls inside a declared detection
by the algorithm. The precision-recall curves are given for real and binary neu-
ral networks in Fig. 1a. A sample image from the test set is shown in Fig. 1b
and Fig. 1c, for real and binary neural networks respectively. As mentioned25
in section 5.2 in the main text, the multiscale detection algorithm can run on
a fraction of the FPGA chip. The sacrifice in detection performance (from 50
percent precision to 40 percent precision) due to binarization follows what is
observed in classification experiments (Fig. 5 in the main text).
2. Classification of Ships from Aerial Thermal Images30
We wanted to test the binary algorithm for a less challenging object classifica-
tion task. Ship classification is essential for an autonomous UAV system that is
deployed over the sea, for general surveillance purposes and specifically for detec-
tion of illegal immigrant activities on fish boats (RECONSURVE Project, ITEA
2Please contact the corresponding author for access to the dataset.
2
0 0.2 0.4 0.6 0.8 10
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Recall
Pre
cis
ion
Real
Binary
(a) (b) (c)
Figure 1: Crossroad detection performance of the classification algorithm in a sliding window
search framework. (a) Precision-Recall curves for real and binary neurons (b) A sample
crossroad detection test image for real neuron. Green rectangles are the ground truth, red
rectangles are the declared detections, yellow pixels represent the true detections. (c) A sample
crossroad detection test image for binary neuron.
2). We introduce a new ship thermal image dataset3 (Fig. 2), in which 15 dif-35
ferent ship images are captured from a low altitude UAV (ASELSANTMthermal
camera), covering 360 deg views (at least 70 images per ship). The ship types are
chosen in consideration to cover a wide spectrum of ship appearances. Segmen-
tation of the ships over the sea background is straightforward on thermal images
(for both day and night operation), which reduces the problem from search based40
detection to classification, and the classification performance is 99% for both
real and binary neural units. Our findings indicate that the classification per-
formance of binary algorithm converges to the real neuron performance for less
challenging classification tasks. Also, it is shown that ship identification and
detection can be performed successfully from a low altitude UAV platform.45
3
Figure 2: Some of the ships that are used in ship classification task. There are 15 different
ship classes each of which contains around 70 images. The images are acquired using an
ASELSANTMthermal camera mounted on a low altitude UAV.
3. Object Tracking
3.1. Object Tracking Algorithm
Neural network based classification framework is utilized for frame by frame
detection of tracked object. Naturally, there are two phases of this tracking al-
gorithm: training and detection. In the training phase (Fig. 3b), tracked object50
and its immediate background is used to create a database of feature vectors
(v). There are two classes in this database, target and background. Using only
one image for target class is an option but in order to enforce robustness and
to compute sub pixel location of the target, multiple target images are used.
Target images are cropped from the image, that are left-right and bottom-top55
shifted versions of the target (Fig. 3a, left). These set of images will have similar
feature representation since they are only a few pixels apart from each other,
however they will enable multiple target detections for robustness and local-
ization precision. For the background class, again multiple images are cropped
that represent the background of the target image (Fig. 3a, right). Regardless60
of the target object size, these sets of images are resized to a constant square
3Please contact the corresponding author for access to the dataset. See supplementary
materials for images of all classes.
4
(b)
(c)
(a)
Figure 3: (a) The image patches used in training are shown in boxes. Left: Target image
patches that are a few pixels apart from the actual target image. Right: Background image
patches that represent the background class. (b) Training phase flowchart of object tracking
algorithm. (c) Detection phase flowchart of object tracking algorithm.
5
region for keeping the computational load fixed. The 4K dimensional feature
vector v is extracted for each of these target and background images. Then
these labeled feature vectors are used to train the linear SVM classifier. Classi-
fier is periodically updated during tracking to enable plasticity for appearance65
changes. There are appearance changes in the scene not only due to target but
also due to occlusion and clutter. False changes in the classifier due to external
noise factors cause drift in tracking algorithms. In order to prevent this, a sanity
check mechanism is used during training such that, if the change (angle change
of the hyperplane) in SVM classifier exceeds a certain threshold, the training is70
rejected.
Periodically, the SVM classifier is trained, and in the following frames it can
be used to label image regions in the neighborhood of the target using stan-
dard object detection procedure: sliding window based search (Fig. 3c). Target
image patches are detected using the SVM classifier and the center of these75
images are labeled as target pixels. The average pixels location of these target
pixels reveals the current frame target location. If the number of target pixels
is below a certain threshold, target lost is declared. However, it is possible to
misclassify background patches as target due to appearance similarities. False
target detections cause tracking errors. However, these misclassifications are on80
average expected to be spatially separated from correct classifications, because
they most likely will originate from an object similar to the target that is in the
search region but spatially distinct. Then, multiple spatial clusters will emerge
for putative target pixels. It is essential to reject misclassified pixels, since fail-
ure to do so would cause jumps and drifts. In order to reject incorrect target85
pixels, spatial clustering is performed on target pixel locations. To determine
the number of clusters, we use Akaike information criterion (AIC, [3]) by fitting
mixture of Gaussians on target pixel locations. Once the most prominent num-
ber of mixture is determined by AIC, then the Gaussian with the closest center
to the previously known target center is assigned as the ”correct” cluster, and90
the rest are rejected. This procedure rejects spurious detections due to clutter.
6
3.2. Standard Tracking Dataset Experiments
Tracking function in an unmanned system is essential for carrying out com-
plex tasks requiring temporal continuity. The proposed tracking algorithm is
tested on standard sequences [4, 5] commonly used in the literature (Table 1).95
The success metrics in the literature vary, but 3 metrics are frequently used:
center location error, 20 pixel precision and coverage. The center location error
of MILTrack [5], proposed algorithm (non-simplified, original), Bolme tracker
(aka Mosse, [6]) and Circulant tracker [7] is given in Table 2. 20 pixel precision
is given in Table 3. Best results are highlighted in red color.100
Table 1: Tracking sequences [5]
Track Difficulty
Sylvester Illumination, pose, scale change, 3D camera motion
David Indoor Illumination, pose, scale change, 3D camera motion
Cola Can Specular object, pose change, occlusion
Occluded Face Occlusion, moving camera
Occluded Face 2 Heavy appearance change and occlusion
Surfer Low contrast and appearance change
Tiger 1 Fast motion, frequent occlusions, motion blur
Tiger 2 Fast motion, frequent occlusions, motion blur
Coupon Book Heavy appearance change, serious clutter
Apart from these, in more recent studies, marginal increases in performances
are reported for some of the sequences:
[8] reported 3.8 and 3.6 pixels center location errors for Occluded Face 2 and
David sequences respectively.
[9] reported 8 and 12 pixels center location errors for Tiger 1 and Cola Can105
sequences respectively.
Tracking algorithms specifically designed for one tracking difficulty (e.g. Oc-
clusion) or one specific object (e.g. face) exist, which yield to superior perfor-
mance for one or two sequences, but fail for most of the rest. Hence, the errors
7
for these algorithms are not considered in the comparison.110
Table 2: Tracking center location error (red: best, green:2nd best)
Sequence MILTrack Bolme Proposed
Sylvester 11 36 8
David Indoor 23 9 10
Cola Can 20 24 13
Occluded Face 27 89 15
Occluded Face 2 20 7 30
Surfer 11 93 7
Tiger 1 16 49 11
Tiger 2 18 34 9
Coupon Book 15 5 7
Table 3: Tracking precision for 20 pixel error (red: best, green:2nd best)
Sequence MILTrack Bolme Circulant Proposed
Sylvester 0.9 0.53 1.0 0.94
David Indoor 0.52 1.0 0.49 0.98
Cola Can 0.55 0.34 1.0 0.95
Occluded Face 0.43 0.07 1.0 0.73
Occluded Face 2 0.6 1.0 1.0 0.44
Surfer 0.93 0.04 0.99 0.96
Tiger 1 0.81 0.18 0.61 0.9
Tiger 2 0.83 0.26 0.63 0.86
Coupon Book 0.69 1.0 1.0 1.0
The algorithm successfully tracks all objects in nine sequences, with the
best performance for six of them. Although proposed algorithm seems to fail
for ”Occluded Face 2” sequence for the reported metrics, close analysis shows
that the tracker drifts to the edge of actual object (face) in the beginning and
keeps tracking throughout the sequence. The coverage metric (0.33 F measure115
8
threshold) is 1.0, which shows that the track was never lost but the center lo-
cation error was larger than 20 pixels, thus causing a smaller precision for this
sequence. The same quantitative result is valid for ”Occluded Face” sequence in
which the coverage is also 1.0. Mosse tracker seems to fail for six of the sequences
but achieved top results for the other three. The reason for this inconsistency120
is hard to locate. However it is not due to poor selection of parameters because
exhaustive search was performed in the parameter space to maximize its per-
formance. Circulant tracker gives superior performance over Bolme, showing
best performance for six of the sequences but gives poor performance for the
rest three sequences. However, we emphasize that the overall performance of125
proposed algorithm is more balanced since it gives good precision almost for all
the sequences.
3.3. Thermal Aerial Images
The intended platform of the designed system uses thermal imagery. For
this purpose, we introduce a very comprehensive thermal video set and test our130
algorithm on this dataset 4. There are total of 33 tracks, annotated every 10-20
frames. The videos are acquired using ASELSANTMthermal imaging products,
some of which are mounted on UAVs for aerial surveillance. Vehicle, people,
apartment/region (Fig. 4) are tracked in a wide variety of scenarios: object
size and type, viewing angle, occlusion, clutter and noise, feature intensity, ap-135
pearance change, abrupt motion. The proposed tracking algorithm is tested on
this dataset for both real and binary neural units (Table 4), investigating the
effect of algorithmic simplification (binarization) on tracking performance. In
the analysis, individual track results are grouped (a single track can contribute
to multiple groups) according to tracking difficulties, since it is hard to analyze140
all 33 tracks. The results show that the performance drop due to binarization
is not significant overall; on the contrary, binary neural network performs sig-
4Please contact the corresponding author for access to the dataset. See supplementary
materials for a detailed description.
9
(a) (b)
(c) (d)
Figure 4: The thermal object tracking dataset introduced in this paper contains 33 tracks,
4 of them are shown as examples. See supplementary materials for a thorough description.
(a) Person. (b) Vehicle, oblique angle, large size. (c) Vehicle, orthographic, small size. (d)
Vehicle medium size, high clutter due to traffic.
nificantly better for some of the categories (occlusion). There is a significant
drop in performance for long run vehicle tracks and abrupt vehicle turns due to
binarization, which suggests a weakness of handling sudden appearance changes145
in the simplified algorithm.
References
[1] J. Cheng, T. Jin, X. Ku, J. Sun, Road junction extraction in high-resolution
sar images via morphological detection and shape identification, Remote
Sensing Letters 4 (3) (2013) 296–305.150
[2] D. Ciresan, U. Meier, J. Schmidhuber, Multi-column deep neural networks
10
Table 4: Tracking coverage, thermal video set (red:best)
Sequences Proposed Proposed Binary
People 0.64 0.87
Vehicle Medium 0.70 0.65
Vehicle Small 0.80 0.71
Vehicle Traffic 0.42 0.56
Vehicle Long Run 0.75 0.43
Apartment and Region 1.0 1.0
Occlusion 0.69 0.92
Clutter 0.58 0.65
Poor Feature 0.71 0.65
Abrupt Turn 0.52 0.33
Sudden Camera Motion 0.75 0.75
for image classification, in: Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 3642–3649.
[3] H. Akaike, A new look at the statistical model identification, Automatic
Control, IEEE Transactions on 19 (6) (1974) 716–723.155
[4] A. Adam, E. Rivlin, I. Shimshoni, Robust fragments-based tracking using
the integral histogram, in: Computer Vision and Pattern Recognition, 2006
IEEE Computer Society Conference on, Vol. 1, IEEE, 2006, pp. 798–805.
[5] B. Babenko, M.-H. Yang, S. Belongie, Robust object tracking with online
multiple instance learning, Pattern Analysis and Machine Intelligence, IEEE160
Transactions on 33 (8) (2011) 1619–1632.
[6] D. S. Bolme, J. R. Beveridge, B. A. Draper, Y. M. Lui, Visual object tracking
using adaptive correlation filters, in: Computer Vision and Pattern Recog-
nition (CVPR), 2010 IEEE Conference on, IEEE, 2010, pp. 2544–2550.
[7] J. F. Henriques, R. Caseiro, P. Martins, J. Batista, Exploiting the circulant165
11
structure of tracking-by-detection with kernels, in: Computer Vision–ECCV
2012, Springer, 2012, pp. 702–715.
[8] X. Jia, H. Lu, M.-H. Yang, Visual tracking via adaptive structural local
sparse appearance model, in: Computer Vision and Pattern Recognition
(CVPR), 2012 IEEE Conference on, IEEE, 2012, pp. 1822–1829.170
[9] Y. Bai, M. Tang, Robust tracking via weakly supervised ranking svm, in:
Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference
on, IEEE, 2012, pp. 1854–1861.
12
Supplementary Material: FPGA Design
Ozgur Yilmaz1
Department of Computer Engineering, Turgut Ozal University, Ankara, Turkey
Ismail Ozsarac
Department of Electronics Design, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,
Turkey.
Omer Gunay
Department of Electronics Design, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,
Turkey.
Huseyin Ozkan
Department of Image Processing, MGEO Division, Aselsan Inc., Akyurt, Ankara, 06011,Turkey.
Abstract
In this supplementary material we provide the details of our FPGA design.
The implementation in FPGA is realized as an IP core (Fig. 1). Due to the
inherent parallelism:
1. Feature volume (X×Y ×D)can be constructed for the whole video frame.
2. Feature vector v can be calculated for many image regions (i.e. sliding
windows for object detection) simultaneously.5
3. The calculated feature vectors can be classified in parallel.
The parallelism supplied by the FPGA core is exploited for search based
object detection and tracking applications. After computing the feature volume,
FPGA receives coordinates of image regions from a CPU, which are queries and10
Email addresses: [email protected] (Ozgur Yilmaz),[email protected] (Ismail Ozsarac), [email protected] (Omer Gunay),[email protected] (Huseyin Ozkan)
1Corresponding author
Preprint submitted to Journal of LATEX Templates December 2, 2014
MEMORY INTERFACE
FEATURE
EXTRACTION
FEATURE
SUMMATION
CLASSIFICATION
FEATURE
VECTORS
CLASS
LABELS
IMAGE ANALYZER
VIDEO
IP CORE
FEATURE
DICTIONARY
CLASS
MATRIX
FEATURE
CALCULATION
REQUESTS
Figure 1: FPGA IP core structure. The memory interface handles data read/write operations
with the external memories. There are three main stages of the Image Analyzer block. Feature
dictionary and classification matrix is pre-uploaded into the FPGA memory. Video frames
and feature calculation requests are constantly feed into the core during operation.
need to be classified according to the pre-uploaded classifier (matrix). FPGA
outputs the class labels of the regions of interest (e.g. sliding window object
detection queries or object tracking queries), and also the feature vectors of the
query regions. The core consists of two main sub-blocks: Image Analyzer and
Memory Interface. Memory Interface is responsible for data transfer between15
Analyzer and external memories. Image Analyzer block consists of three sub-
blocks (Fig. 1): Feature Extraction, Feature Summation and Classification.
Image Analyzer block receives four types of inputs; video, feature dictionary
(Fig. 3a), class matrix (Fig. 5a) and feature calculation requests. Feature
dictionary and classification matrix is pre-uploaded into the FPGA memory.20
Video frames and feature calculation requests are constantly feed into the core
during operation. The resolution of the video is X (row) by Y (column) and
the frame rate is considered to be higher than 10 Hz.
2
FEATURE EXTRACTION
TAKE PATCHCONSTRUCT
P VECTOR
COMPUTE MEAN
VALUE OF P
VIDEO
CONSTRUCT BINARYPB VECTOR
CALCULATE BIT FLIPPING DISTANCE VECTOR (DV) OF PB WITH DICTIONARY D
COMPUTE MEAN
VALUE OF DV
COMPUTE STANDART
DEVIATION VALUE OF DV
COMPUTE ACTIVATION THRESHOLD OF DV
COMPUTE PIXEL FEATURE VECTOR (PFV)
OF DV
PIXEL FEATURE
VECTORS
Figure 2: Feature extraction block. It consists of many sub-blocks. Video frame is feed into
this block and output is the hidden layer volume representation of the frame, which consists
of a K length feature vector for each pixel in the frame.
1. Feature Extraction
Feature extraction block preemptively builds the hidden layer representation25
of the whole video frame (i.e. for every pixel). This block (Fig. 2) starts
with ”take patch” process (Fig. 3b). Notice that a patch is a W ×W small
image region, the same size as the receptive fields (RF). To capture the related
pixels, the incoming video line is written to LINE FIFO. According to the
patch dimension (W ), ”take patch” process uses W LINE FIFO. Each incoming30
video line is firstly written to LINE FIFO-W , and then when the next video
line is coming, the previous one is read from LINE FIFO-W and written to
LINE FIFO-(W − 1). These steps continue until all LINE FIFOs are filled with
the necessary lines to construct the patch. When all lines are available, with
the next line coming, pixel values are read from the FIFOs. After W read35
operations the patch is ready for the later operations. The W + 1 reading from
the LINE FIFOs gives the patch of the next pixel. These steps continue until all
patches are captured through a line. During patch read from the FIFOs, new
3
Y
X
W
W
D11
D21
DT1
D1K
D2K
DTK
D
DC1 DCK
LINE FIFO - 1
LINE FIFO -2
LINE FIFO – (W 1)
LINE FIFO W
P1PX
Pn + 1Pn +W
Lm + 1
Lm + W
TAKE PATCH
VIDEO LINE
–
–
Pn+1Pn+W
Lm + 1
Lm + W
P
L1P1
L1P2
L1PW
LWPW
P
L1P1
L1P2
L1PW
LWPW
COMPARE
Pµ
PB
1 or 0
1 or 0
1 or 0
1 or 0
T
XOR
PB
DC1
XORPB
DC2
XORPB
DCK
COMPUTE
TOTAL “1”s
COMPUTE
TOTAL “1”s
COMPUTE TOTAL “1”s
CONSTRUCT
DISTANCEVECTOR
DV
DV
DV1 DVK
CALCULATE BIT
FLIPPING DISTANCE
COMPAREAT
DV1
COMPAREAT
DV2
COMPAREAT
DVK
CONSTRUCT
FEATUREVECTOR
FV
PFV
PFV1 PFVK
COMPUTE
PIXEL FEATURE VECTOR
(a) (b) (c)
(d) (e) (f)
Figure 3: Sub-routines and data structures of the feature extraction block. (a) X by Y frame
and a single patch of size W by W is shown on top, and dictionary D that consists of K
receptive fields is shown in the bottom. (b) Take patch sub-block, that captures a patch
from the image (c) Construct P vector sub-block, that vectorizes the patch. (d) PB vector
construction block, that binarizes the vector. (e) Distance vector calculation sub-block, that
computes the Hamming distance between the binary vector PB and the binary receptive
fields in the dictionary. (f) Pixel feature vector calculation sub-block, that computes the
sparse hidden layer representation for each pixel value.
4
lines continue to move to upper FIFOs. This movement generates the patch
downward movement through the frame.40
The patch is vectorized and the P vector is constructed by using the captured
patch pixel values (Fig. 3c). This construction process is a register assignment.
There are W ×W registers from L1P1 to LWPW and every register keep the
related pixel value. The bit size of the registers is determined by the maximum
possible pixel value (8 bits in general).45
Mean value of the P vector is needed for binarization. To calculate mean
patch value Pµ, every pixel value in the patch should be added and then divided
by the total number of pixels. The addition process can be realized by the
adders; the input number of the adders can be different according to the FPGA
capability. The adder input number affects the pipeline clock latency and the50
number of adders used. After all pixel values are added, the total is divided by
W ×W .
After calculating the Pµ, each entry of the P vector is compared with Pµ, and
binarized to construct the vector PB (Fig. 3d). Binarization step is essential
for realizing this neural network algorithm in currently available FPGAs. For55
the values that are less than Pµ, ’0’ is assigned, otherwise, ’1’ is assigned. After
all values are compared with mean value, binary vector PB is obtained. PB is
a T by 1 bit vector where T equals to W ∗W .
Every binary vector constructed from all the patches in an image are trans-
formed into a feature vector (hidden layer representation) using a pre-computed60
dictionary that has K number of visual words. The dictionary D is a T by K
bit binary matrix. The columns of D matrix (DC1 to DCK) are stored in
internal registers DCX of FPGA (Fig. 3a). The dictionary is loaded to FPGA
by means of the communication interfaces like PCI, VME etc. The entries of
the dictionary can be updated any time since the entries are stored in internal65
registers.
Bit flipping (Hamming) distance calculation computes the similarity between
two binary vectors of the same size: PB and every column DC of D (Fig. 3e). If
two entries of PB and DC are the same, ’0’ is assigned, otherwise ’1’ is assigned.
5
This operation is realized by XOR blocks. The total number of 1 values after70
XOR operation is a measure of similarity between the two binary vectors. DV
contains the Hamming distance of a single PB vector to all the visual words
in the dictionary. The entries of DV keep the total number of 1s, so they are
integer values. DV is an H by K bit vector. H is the minimum number of bits
that can define the scalar value T.75
Vector DV represents the inverse activity of each neuron in the dictionary
for a patch, and it needs to be sparsified and binarized. The mean value of DV
is computed similar to the one of P . To calculate standard deviation of DV ,
DV µ is subtracted from each entry of DV . Then the square of the subtraction
is calculated and all the squares are added. Then the total value is divided80
by K. Finally, the square root is calculated and DV σ is obtained. Activation
threshold AT is calculated by subtracting SPARSITYMULTIPLIER ∗DV σ
from DV µ. This adaptive threshold is used to construct a sparse representation
(from DV ) via nullifying the distance values larger than this threshold.
To construct the pixel feature vector (PFV in Fig. 3f), each entry of DV85
is compared with AT . If the entry is greater than AT then ’0’ is assigned to
related entry of PFV , if it is smaller ’1’ is assigned. The result is a 1 by K pixel
feature vector PFV . For each pixel in a video frame, 1 by K bit vector (pixel
feature vector PFV ) is obtained thus the hidden layer volume is constructed.
The computed PFV s are sent to the memory interface to be written to external90
memories.
2. Feature Summation
Once the sparse hidden layer layer activity of each pixel is computed in
Feature Extraction block (Fig. 2), a feature vector of an image region can
be calculated by spatial summation based dimensionality reduction. This pro-95
cedure divides the region into four quadrants, sums the pixel feature vectors
(PFV ) inside each quadrant and concatenates the 4 summations to arrive at
feature vector (FV ) of an image region. This feature vector is then forwarded
6
to Classification block or the CPU (Fig. 1). Multiple of these feature vectors
can be computed in parallel for many image regions of interest, for object detec-100
tion and tracking purposes, and these requests are communicated to the FPGA
through a specific interface.
The feature calculation requests are written to the Feature Calculation Re-
quest FIFO (Fig. 4a), given as pixel coordinates of the region of interest. The
CPU sends the coordinates of two border pixels (upper-left and lower-right,105
black dots) and FPGA calculates the rest (white dots in Fig. 4d) of the coor-
dinates of the sub-regions. According to pixel coordinates, the Internal RAM
addresses are calculated by Address calculator block (Fig. 4a). This block
knows the content of the RAM, namely the line coordinates that are stored. To
make the calculations faster, the PFV values are read from external memory110
and written to Internal RAM. The RAM can store R ∗X ∗K bit data, where
R is the maximum number of lines that can be processed at a time (Fig. 4b).
Integral image is used to speed up feature summation requests that is likely
to be over multiple overlapping regions (Fig. 4c). In that case, integral image
avoids duplicate summation operations. Integral vector calculator reads the115
necessary PFV s from the RAM to calculate the integral vector. Notice that
the PFV volume is three dimensional: two spatial dimensions and one feature
dimension. Integral vector IV entry is the summation of the all entries of
previous PFV s on both horizontal and vertical dimensions. The integration
operations form integral images for each feature dimension. Then the sums for120
each quadrant can be computed separately by 4 additions (QIV in Fig. 4d).
Since there exist four quadrants (Q1, Q2, Q3 and Q4), all quadrant results are
concatenated and final feature vector FV is obtained. The FV is G ∗ S bit
vector. S is the minimum bit number that can store the all 1s in the quadrant.
G = 4×K. The vector is stored in the internal RAM of the FPGA. This Feature125
Vector (FV ) is a discriminative and efficient representation of the image patch
defined by the border coordinates (black dots in Fig. 4d), and it can further be
used for classification and clustering purposes, executed either in FPGA or in
CPU via memory transfer.
7
INTERNAL RAM
FEATURE SUMMATION
FEATURE
CALCULATION
REQUESTS
FEATURECALCULATION
REQUEST FIFO
ADRESS
CALCULATOR
MEMORY
INTERFACE
INTEGRAL VECTOR (IV) CALCULATOR
FV
K
X X
XxR
INTERNAL RAM
PFV
PFV1
PFVK
INTERNAL
RAM
ADDPFV1
ADDPFVK
IV
IV1
IVK
Q1 Q2
Q3 Q4
FV
QIV1
QIV2
QIV3
QIV4
G
PFV11 PFV12
PFV21 PFV22
IV11 IV12
IV21 IV22
Q1
(a) (b)
(c) (d)
Figure 4: Sub-routines and data structures of the feature summation. (a) Overall diagram of
the feature summation. After receiving feature calculation requests, feature summation block
computes feature vector of an image region via spatial pooling based dimensionality reduction.
(b) Internal RAM structure. (c) Integral image is formed for efficient spatial pooling operation
on multiple image region feature vector calculation requests. (d) The spatial pooling reduces
to addition operations once the integral image is formed, and this is performed separately in
parallel for the 4 quadrants (QIV ). Then the summations are concatenated to form feature
vector (FV ) representation of the image region defined by the corner coordinates.
8
C11
C21
CJ1
C1G
C2G
CJG
C
CC1 CCG
MULTCX1
FV1
MULTCX2
FV2
MULTCXG
FVG
ADD
CL
CL1
CLJ
JC MATRIX
ROW
ARBITER
CLASSIFICATION
(a) (b)
Figure 5: Sub-routines and data structures of the Classification block. (a) Classification matrix
that is used for linear classification. (b) Classification matrix is multiplied with feature vector
FV to arrive at class likelihood vector, CL.
When pooling operations on requested coordinates are finished, RAM is up-130
dated with new lines, and new pooling calculations are started. These processes
are controlled by integral vector calculator with the aid of address calculator.
3. Classification
Classification block generates a class label likelihood vector using a linear
classification method. It performs matrix-vector multiplication of class matrix135
C (Fig. 5a) with FV . The class matrix C is loaded to FPGA just like feature
dictionary D. Row Arbiter controls the C matrix row management for the FV
multiplication. The C matrix is J ∗ G ∗ S bit matrix, where J is the number
of trained classes, G is the feature dimension and S is the bit precision. The
result is the class label likelihood CL vector. The entities CLX of the CL are140
the addition of the multiplication of FV with C rows. The CL can be sent to
the CPU for further processing, i.e. classification/detection decision, or a max
operation can be applied to assign a class label, which is the max index of the
vector CL.
9
FPGA
STRATIX V
( 5SGXA7 )
DDR3 SDRAM
64 Bit
@ 600MHz
CPU2.5Gbps
PCIe
DEDICATED HARDWARE
Figure 6: The algorithm is implemented on a dedicated hardware using Stratix V (5SGXA7)
ALTERATMFPGA chip. It consists of memory, CPU and the FPGA chip, providing fast
communication among the components.
4. Hardware Usage and Timing145
The algorithm is implemented on a Stratix V (5SGXA7) ALTERATMbased
dedicated hardware (Fig. 6). Table 1 shows the FPGA and hardware resource
usage2 of the object detection implementation for 640*512 @50Hz video rate,
RF size 4 pixels, 20,000 numbers of 64*64 pixels sliding windows, which corre-
sponds to at least 5 different scales (1.3, 1.2, 1, 0.8, 0.6) of exhaustive object150
search with 8 pixels shift between windows. Therefore, an effective and fast
(> 10 Hz) embedded object detection framework can be constructed using a
fraction of an FPGA chip, that is to be deployed on UAVs. Notice that, object
detection, object tracking (> 20 objects) can be simultaneously executed us-
ing less than 20% of the FPGA resources, the rest of which can be utilized for155
other tasks such as salient motion detection, sparse feature extraction and vi-
sual odometry. Saliency of the detected motion/change can be determined using
the same classification framework. Moving pixels can be analyzed for saliency
(pre-determined object class detection) in a multi-scale manner via appropriate
CPU-FPGA communication, for less than 10% of the FPGA resources. Sparse160
2See supplementary materials for the detailed timing and hardware usage analysis.
10
feature extraction requires further computation on hidden layer activities of in-
dividual pixels, however the computational complexity of this additional stage
is predicted to be low.
The timing of calculations and delay analysis is as follows. The hidden layer
activity calculations can be realized in pipeline order until the end of the Feature165
Extraction block. There is only pipeline latency and PFV can be calculated in
less than 1µs after the patch is available. Feature Summation (calculation of
FV ) block needs to read PFV s from external memory and store them in the
internal memory. The external and internal memory read operations introduce
a lag, however frame delay due to this lag is mitigated by using multiple buffers170
and optimal order of requests calculated and communicated by CPU part. In
that, FPGA fetches frame data portions from external into internal memory in
an order and CPU feature summation requests are synchronized with this order.
Classification block requires FV s to be computed. After FV s are ready, class
label CLs can be calculated in less than 1µs. Therefore, the system is operating175
in real time without any frame delay.
As a summary, for less than 30% of the FPGA resources, object detection,
object tracking (multiple), salient motion detection and scene recognition can
be executed in a satisfactory accuracy. Sparse feature extraction needs to be
worked out and analyzed to understand its resource usage.180
Table 1: FPGA and Hardware Resource Usage Summary
Property Used Available Occupied (%)
Logic 30,000 622,000 5
Internal RAM 250 × M20K 2560 × M20K 10
Multiplier (18 × 18) 40 512 8
Ext. Memory Bandw. 6 Gps 75 Gps 8
CPU Interface Bandw. 0.1 Gbps 2.5 Gbps 4
References
11
FPGA Resource Usage and Timing
The FPGA implementation is analyzed for the 640 x 512 @ 50Hz (20ms frame period) video on a
dedicated hardware.
Fig.1 : Frame and patch dimensions.
FPGA
STRATIX V
(5SGXA7)
DDR3 SDRAM
64 Bit
@ 600MHz
CPU2.5Gbps
PCIe
DEDICATED HARDWARE
Fig.2 : Structure of the dedicated hardware.
FPGA properties are detailed below in TABLE 1.
TABLE.1 : Selected FPGA Properties
Property Quantity
Logic 622,000 LE
Internal RAM 2,560*M20K
Multiplier 512*18x18
TABLE.2 : Selected Hardware Properties
Property Quantity
External Memory
Bandwidth
64 x 600 x 2 = 75 Gbps
CPU Interface
Bandwidth
2.5 Gbps
Y =
51
2
X = 640
K = 4
@ 50Hz
Pixel CLK = 20 MHz
Feature Extraction
TABLE.3 : FPGA Resource Usage and Timing
TAKE PATCH
X = 512
Y = 640
W = 4
CLK = Pixel CLK (20 MHz)
Pixel Depth = 8 bit
Property Quantity
Logic 120 lut/256 reg
Internal
Memory
4 * M20K
Latency* ~128.5 us
*This latency occurs only at beginning of the frame
Pµ
CLK = Pixel CLK (20 MHz)
Property Quantity
Logic 5 * 4 input adder,
1 * 12 bit divider
Latency** ~0.2 us
**Pipeline latency
PB
T = 16
CLK = Pixel CLK (20 MHz)
Property Quantity
Logic 16 * IF check
Latency** ~0.05 us
DICTIONARY D
K = 100
CLK = Pixel CLK (20 MHz)
Property Quantity
Logic 100 * 16 bit register
DV
T = 16
H = 5
CLK = 5 * Pixel CLK = 100 MHz
Property Quantity
Logic 20 * 16bit XOR,
20*16* If check,
20*16* 4bit counter
100 * 5 bit register
Latency** ~0.2 us
DVµ
CLK = Pixel CLK (20 MHz)
Property Quantity
Logic 30 * 4 input adder,
1 * 12 bit divider
Latency** ~0.3 us
DVσ
CLK = 5 * Pixel CLK = 100 MHz
Property Quantity
Logic 10 * 5 bit subtractor,
6* 4 input adder,
1*13bit divider,
1* 8 bit square root operation
Multiplier 10*9x9 mult
Latency** ~0.05 us
AT
CLK = Pixel CLK (20 MHz)
Property Quantity
Logic 1* 5 bit subtractor,
Multiplier 1*9x9 mult
Latency** ~0.01 us
PFV
CLK = 5 * Pixel CLK = 100 MHz
100 Bit per Pixel
Property Quantity
Logic 20*If check,
External
Memory
100 x 20 = 2000Mbps
= 1.95 Gbps
Latency** ~0.1 us
Feature Summation
INTERNAL RAM
R = 64
CLK = 10 * Pixel CLK = 200 MHz
Property Quantity
External
Memory
100 x 20 = 2000Mbps
= 1.95 Gbps
Internal
Memory
200 * M20K
REQUEST FIFO
REQUEST: 5000 request at a time, can be
updated by new requests when the previous
requests are fulfilled.
Requests are 10 bit row; 10 bit column
coordinates of two pixels. Total 40 bit
Property Quantity
Internal
Memory
10 * M20K
ADDRESS CALCULATOR
CLK = 10 * Pixel CLK = 200 MHz
Property Quantity
Logic RAM address management
logic
Multiplier 4*18x18 mult
Latency** ~0.05 us
IV CALCULATOR
CLK = 15 * Pixel CLK = 300 MHz
Maximum quadrant size = 32 x 32
S = 10 bit
G = 400
Given latency for the calculation of 4*32x32
quadrant. For the other dimensions use the
formula; 4xQSize x QSize x 3.3ns (300 Mhz)
Internal Memory has two separate read ports so it
is possible to calculate two quadrant at the same
time
Since all the PFVs are stored in external memory
frame latency (20 ms) can be used to calculate
requests.
Property Quantity
Logic RAM read operations,
100* 2 input adder
Internal
Memory
2 * M20K
Latency ~6.8 us
Classification
CLASS C
J = 10
G = 400
CXX is 10 bit
Property Quantity
Internal
Memory
2 * M20K
CLASSIFICATION
CLK = 15 * Pixel CLK = 300 MHz
Given latency is for the calculation of one CL.
Property Quantity
Logic 20* 2 input adder
Multiplier 20*18x18 mult
Latency ~0.66 us
The calculations can be realized in pipeline order until the end of the Feature Extraction block. There
is only pipeline latency and PFVs can be calculated less than 1us after the patch is available.
Feature Summation block needs to read PFVs from external memory and stores them in the internal
memories. Due to smart and synchronized design of requests from the CPU the lag due to memory
transfer does not cause frame delay. Classification is similar with Feature Summation and it requires
FVs. After a FV is ready, class label CL can be calculated less than 1us.
It is possible to achieve higher than 10 Hz frame rate with over 20k 32x32 quadrant size. Below are
the usage for this configuration.
TABLE.4 : FPGA Usage Summary
Property Quantity
Logic ~ 30,000
Internal RAM ~ 250*M20K
Multiplier ~ 40*18x18
TABLE.2 : Hardware Usage Summary
Property Quantity
External Memory
Bandwidth
~ 6 Gps
CPU Interface
Bandwidth
~ 0.1 Gbps
Thermal Object Tracking Dataset
The images are acquired using an image grabber attached to different ASELSAN thermal imaging
products. The acquired images in the dataset are not raw images but they are quantized, histogram
equalized and enhanced. More detail on the image sources can be provided on demand.
Description of Tracks
A short description and a representative image (below description) will be given in this section. The
description will include the object type, object size, scene characteristics, viewing angle and the
hardship of the sequence (clutter, occlusion etc.). In the images, bounding box (black) and center of
the track (white dot) are overlaid.
Track 1: Person, large object, urban scene, partial occlusion, clutter, non-rigid motion.
Track 2: Person, large object, urban scene, abrupt orientation change, non-rigid motion.
Track 3: Vehicle, large object, urban scene, full occlusion, oblique view.
Track 4: Vehicle, large object, urban scene, full occlusion, oblique view.
Track 5: Vehicle, large object, urban scene, full occlusion, oblique view.
Track 6: Vehicle, medium object, urban scene, full occlusion, oblique view.
Track 7: Vehicle, large object, urban scene, full occlusion, intensity change, oblique view.
Track 8: Vehicle, tiny object, urban scene, abrupt orientation change, clutter, oblique view.
Track 9: Vehicle, tiny object, urban scene, low contrast, high clutter, oblique view.
Track 10: Vehicle, tiny object, urban scene, low contrast, high clutter, oblique view.
Track 11: Vehicle, medium object, urban scene, clutter, partial occlusion, oblique view.
Track 12: Vehicle, small object, rural scene, low contrast, clutter, partial occlusion, oblique view.
Track 13: Vehicle, tiny object, urban scene, air view.
Track 14: Vehicle, small object, urban scene, long run, clutter, partial occlusion, sudden jump, air
view.
Track 15: Vehicle, small object, urban scene, sudden jump, air view.
Track 16: Vehicle, small object, urban scene, smooth orientation change, air view.
Track 17: Vehicle, small object, urban scene, clutter, air view.
Track 18: Vehicle, tiny object, urban scene, high contrast, no clutter, air view.
Track 19: Vehicle, tiny object, urban scene, high contrast, high clutter: similar object, air view.
Track 20: Vehicle, tiny object, urban scene, high contrast, sudden jump, air view.
Track 21: Apartment, medium size, urban scene, texture high, no clutter, air view.
Track 22: Apartment, medium size, urban scene, texture high, high clutter: similar object, air view.
Track 23: Region, large size, urban scene, texture low, smooth orientation change, air view.
Track 24: Region, medium size, urban scene, texture high, air view.
Track 25: Vehicle, small object, urban scene, long run, clutter, sudden jump, orientation and scale
change, air view.
Track 26: Vehicle, tiny object, urban scene, clutter, orientation change, low contrast air view.
Track 27: Vehicle, small object, urban scene, abrupt orientation change, air view.
Track 28: Vehicle, small object, urban scene, smooth orientation change, clutter, air view.
Track 29: Vehicle, small object, urban scene, smooth orientation change, clutter, air view.
Track 30: Vehicle, small object, urban scene, full occlusion, high clutter, air view.
Track 31: Vehicle, tiny object, urban scene, low contrast, high clutter, air view.
Track 32: Vehicle, medium object, urban scene, high clutter, air view.
Track 33: Vehicle, medium object, urban scene, high clutter, smooth orientation change, contrast
change, air view.
SCENARIO VARIETY
In this section different scenarios in the dataset will be highlighted. A subset of the dataset can be
used in algorithmic studies according to imaging system or operational needs.
Oblique View
The scene is viewed from the side in some of the tracks, which is always the case for ASIR systems or
sometimes for 300T systems according to the gimbals’ orientation. Oblique view may show
perspective effects if there are nearby objects. Also this viewing angle is more prone to occlusion.
Air view
The scene is viewed from above in some of the sequences. This case generally generates affine image
motion and occlusion is rare, however the object size may become very small due to distance and
clutter/noise can cause problems.
Object Size and Type
Vehicle, people, apartments or regions are tracked in the sequences. The variation in vehicle size is
largest among different object types.
Feature Intensity
The amount of image features (edge, corner, blob etc.) determines the richness of the target. The
higher the feature intensity, the better the target can be discriminated from background and the
better target tracking performs. The feature intensity varies in the dataset: there are very poor
targets especially for very small sizes and very rich targets specifically for regions and apartments.
Occlusion
Both partial and full occlusions are observed in the tracks. Recovering from full occlusion is a
challenge for tracking algorithms and it can be tested in the dataset. Occlusion happens for different
size objects, so there is variety in occlusion scenarios.
Clutter and Noise
Clutter can be defined as structured noise, and it generally stems from other objects nearby the
tracked object. The strength of clutter and similarity with the target determines the difficulty of the
scenario. Image noise also varies in the dataset according to zoom level, time of day the image is
acquired and many other factors. The SNR is not calculated in the dataset but it can be done using
standard techniques in the literature.
Appearance Change
Sometimes the appearance of the tracked object alters due to orientation change or intensity
changes caused probably by image enhancement. The orientation change happens smoothly or
abruptly in different tracks, and it is specified in the track description.
Abrupt Motion
In some of the tracks the camera makes a sudden orientation change which makes an abrupt image
motion. It is important to handle these situations since this may happen during operation due to
stabilization errors or user intervention.
Thermal Ship Classification Dataset
15 different ships are imaged using ASELSAN thermal camera, mounted on a low altitude UAV. 360
degree view of the ships are captured and sample image for each ship is shown below.