Date post: | 16-Mar-2018 |
Category: |
Technology |
Upload: | dmytro-mishkin |
View: | 802 times |
Download: | 5 times |
Visual Object Tracking Belka
Dmytro Mishkin
Center for Machine Perception
Czech Technical University in Prague
duchaaikigmailcom
Kyiv Ukraine
2017
My background
bull PhD student of Czech Technical university in Prague
Now fully working in Deep Learning
recent paper ldquoAll you need is a good initrdquo added to
Stanford CS231n course
bull exCTO of Clear Research (clearsx) 2014-2017
bull Co-founder of Szkocka Research Group Ukrainian open
research community for computer science
httpswwwfacebookcomgroups839064726190162
Currently supervising one project related to local
features learning
bull Reviewer for the most impactful computer vision
journals TPAMI and IJCV
2
Lecture is heavily based on
tutorial
Visual Tracking
by Jiri Matas
bdquohellip Although tracking itself is by and
large solved problemldquo
-- Jianbo Shi amp Carlo Tomasi
CVPR1994 --
Application domains of Visual Tracking
bull monitoring assistance surveillance
control defense
bull robotics autonomous car driving
rescue
bull measurements medicine sport
biology meteorology
bull human computer interaction
bull augmented reality
bull film production and postproduction
motion capture editing
bull management of video content indexing
search
bull action and activity recognition
bull image stabilization
bull mobile applications
bull camera ldquotrackingrdquo4150Slide credit Jiri Matas
Applications applications applications hellip
5150Slide credit Jiri Matas
Tracking Applications hellip
ndash Team sports game analysis player statistics video annotation hellip
6150Slide credit Jiri Matas
Sport examples
httpcvlabepflch~lepetithttpwwwdartfishcomenmedia-galleryvideosindexhtm
Slide Credit Patrick Perez 7150
Model-based Tracking People and Faces
httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml
Slide Credit Patrick Perez 8150
Tracking is commonly used in practicehellip
9150
bull Tracking is popular research topic for decades
see CVPR ICCV ECCV hellip
bull But there is no online course devoted to trackinghellip
bull nor big coverage in computer vision courses
bull nor it is well covered in textbooks
Is it clear what tracking is
video credit
Helmut
Grabner
10150Slide credit Jiri Matas
Tracking Formulation - Literature
Surprisingly little is said about tracking in standard textbooks
Limited to optic flow plus some basic trackers eg Lucas-Kanade
Definition (0)
[Forsyth and Ponce Computer Vision A modern approach 2003]
ldquoTracking is the problem of generating an inference about the
motion of an object given a sequence of images
Good solutions of this problem have a variety of applicationshelliprdquo
11150Slide credit Jiri Matas
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
12150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
13150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
14150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
My background
bull PhD student of Czech Technical university in Prague
Now fully working in Deep Learning
recent paper ldquoAll you need is a good initrdquo added to
Stanford CS231n course
bull exCTO of Clear Research (clearsx) 2014-2017
bull Co-founder of Szkocka Research Group Ukrainian open
research community for computer science
httpswwwfacebookcomgroups839064726190162
Currently supervising one project related to local
features learning
bull Reviewer for the most impactful computer vision
journals TPAMI and IJCV
2
Lecture is heavily based on
tutorial
Visual Tracking
by Jiri Matas
bdquohellip Although tracking itself is by and
large solved problemldquo
-- Jianbo Shi amp Carlo Tomasi
CVPR1994 --
Application domains of Visual Tracking
bull monitoring assistance surveillance
control defense
bull robotics autonomous car driving
rescue
bull measurements medicine sport
biology meteorology
bull human computer interaction
bull augmented reality
bull film production and postproduction
motion capture editing
bull management of video content indexing
search
bull action and activity recognition
bull image stabilization
bull mobile applications
bull camera ldquotrackingrdquo4150Slide credit Jiri Matas
Applications applications applications hellip
5150Slide credit Jiri Matas
Tracking Applications hellip
ndash Team sports game analysis player statistics video annotation hellip
6150Slide credit Jiri Matas
Sport examples
httpcvlabepflch~lepetithttpwwwdartfishcomenmedia-galleryvideosindexhtm
Slide Credit Patrick Perez 7150
Model-based Tracking People and Faces
httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml
Slide Credit Patrick Perez 8150
Tracking is commonly used in practicehellip
9150
bull Tracking is popular research topic for decades
see CVPR ICCV ECCV hellip
bull But there is no online course devoted to trackinghellip
bull nor big coverage in computer vision courses
bull nor it is well covered in textbooks
Is it clear what tracking is
video credit
Helmut
Grabner
10150Slide credit Jiri Matas
Tracking Formulation - Literature
Surprisingly little is said about tracking in standard textbooks
Limited to optic flow plus some basic trackers eg Lucas-Kanade
Definition (0)
[Forsyth and Ponce Computer Vision A modern approach 2003]
ldquoTracking is the problem of generating an inference about the
motion of an object given a sequence of images
Good solutions of this problem have a variety of applicationshelliprdquo
11150Slide credit Jiri Matas
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
12150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
13150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
14150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Lecture is heavily based on
tutorial
Visual Tracking
by Jiri Matas
bdquohellip Although tracking itself is by and
large solved problemldquo
-- Jianbo Shi amp Carlo Tomasi
CVPR1994 --
Application domains of Visual Tracking
bull monitoring assistance surveillance
control defense
bull robotics autonomous car driving
rescue
bull measurements medicine sport
biology meteorology
bull human computer interaction
bull augmented reality
bull film production and postproduction
motion capture editing
bull management of video content indexing
search
bull action and activity recognition
bull image stabilization
bull mobile applications
bull camera ldquotrackingrdquo4150Slide credit Jiri Matas
Applications applications applications hellip
5150Slide credit Jiri Matas
Tracking Applications hellip
ndash Team sports game analysis player statistics video annotation hellip
6150Slide credit Jiri Matas
Sport examples
httpcvlabepflch~lepetithttpwwwdartfishcomenmedia-galleryvideosindexhtm
Slide Credit Patrick Perez 7150
Model-based Tracking People and Faces
httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml
Slide Credit Patrick Perez 8150
Tracking is commonly used in practicehellip
9150
bull Tracking is popular research topic for decades
see CVPR ICCV ECCV hellip
bull But there is no online course devoted to trackinghellip
bull nor big coverage in computer vision courses
bull nor it is well covered in textbooks
Is it clear what tracking is
video credit
Helmut
Grabner
10150Slide credit Jiri Matas
Tracking Formulation - Literature
Surprisingly little is said about tracking in standard textbooks
Limited to optic flow plus some basic trackers eg Lucas-Kanade
Definition (0)
[Forsyth and Ponce Computer Vision A modern approach 2003]
ldquoTracking is the problem of generating an inference about the
motion of an object given a sequence of images
Good solutions of this problem have a variety of applicationshelliprdquo
11150Slide credit Jiri Matas
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
12150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
13150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
14150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Application domains of Visual Tracking
bull monitoring assistance surveillance
control defense
bull robotics autonomous car driving
rescue
bull measurements medicine sport
biology meteorology
bull human computer interaction
bull augmented reality
bull film production and postproduction
motion capture editing
bull management of video content indexing
search
bull action and activity recognition
bull image stabilization
bull mobile applications
bull camera ldquotrackingrdquo4150Slide credit Jiri Matas
Applications applications applications hellip
5150Slide credit Jiri Matas
Tracking Applications hellip
ndash Team sports game analysis player statistics video annotation hellip
6150Slide credit Jiri Matas
Sport examples
httpcvlabepflch~lepetithttpwwwdartfishcomenmedia-galleryvideosindexhtm
Slide Credit Patrick Perez 7150
Model-based Tracking People and Faces
httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml
Slide Credit Patrick Perez 8150
Tracking is commonly used in practicehellip
9150
bull Tracking is popular research topic for decades
see CVPR ICCV ECCV hellip
bull But there is no online course devoted to trackinghellip
bull nor big coverage in computer vision courses
bull nor it is well covered in textbooks
Is it clear what tracking is
video credit
Helmut
Grabner
10150Slide credit Jiri Matas
Tracking Formulation - Literature
Surprisingly little is said about tracking in standard textbooks
Limited to optic flow plus some basic trackers eg Lucas-Kanade
Definition (0)
[Forsyth and Ponce Computer Vision A modern approach 2003]
ldquoTracking is the problem of generating an inference about the
motion of an object given a sequence of images
Good solutions of this problem have a variety of applicationshelliprdquo
11150Slide credit Jiri Matas
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
12150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
13150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
14150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Applications applications applications hellip
5150Slide credit Jiri Matas
Tracking Applications hellip
ndash Team sports game analysis player statistics video annotation hellip
6150Slide credit Jiri Matas
Sport examples
httpcvlabepflch~lepetithttpwwwdartfishcomenmedia-galleryvideosindexhtm
Slide Credit Patrick Perez 7150
Model-based Tracking People and Faces
httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml
Slide Credit Patrick Perez 8150
Tracking is commonly used in practicehellip
9150
bull Tracking is popular research topic for decades
see CVPR ICCV ECCV hellip
bull But there is no online course devoted to trackinghellip
bull nor big coverage in computer vision courses
bull nor it is well covered in textbooks
Is it clear what tracking is
video credit
Helmut
Grabner
10150Slide credit Jiri Matas
Tracking Formulation - Literature
Surprisingly little is said about tracking in standard textbooks
Limited to optic flow plus some basic trackers eg Lucas-Kanade
Definition (0)
[Forsyth and Ponce Computer Vision A modern approach 2003]
ldquoTracking is the problem of generating an inference about the
motion of an object given a sequence of images
Good solutions of this problem have a variety of applicationshelliprdquo
11150Slide credit Jiri Matas
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
12150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
13150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
14150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking Applications hellip
ndash Team sports game analysis player statistics video annotation hellip
6150Slide credit Jiri Matas
Sport examples
httpcvlabepflch~lepetithttpwwwdartfishcomenmedia-galleryvideosindexhtm
Slide Credit Patrick Perez 7150
Model-based Tracking People and Faces
httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml
Slide Credit Patrick Perez 8150
Tracking is commonly used in practicehellip
9150
bull Tracking is popular research topic for decades
see CVPR ICCV ECCV hellip
bull But there is no online course devoted to trackinghellip
bull nor big coverage in computer vision courses
bull nor it is well covered in textbooks
Is it clear what tracking is
video credit
Helmut
Grabner
10150Slide credit Jiri Matas
Tracking Formulation - Literature
Surprisingly little is said about tracking in standard textbooks
Limited to optic flow plus some basic trackers eg Lucas-Kanade
Definition (0)
[Forsyth and Ponce Computer Vision A modern approach 2003]
ldquoTracking is the problem of generating an inference about the
motion of an object given a sequence of images
Good solutions of this problem have a variety of applicationshelliprdquo
11150Slide credit Jiri Matas
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
12150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
13150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
14150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Sport examples
httpcvlabepflch~lepetithttpwwwdartfishcomenmedia-galleryvideosindexhtm
Slide Credit Patrick Perez 7150
Model-based Tracking People and Faces
httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml
Slide Credit Patrick Perez 8150
Tracking is commonly used in practicehellip
9150
bull Tracking is popular research topic for decades
see CVPR ICCV ECCV hellip
bull But there is no online course devoted to trackinghellip
bull nor big coverage in computer vision courses
bull nor it is well covered in textbooks
Is it clear what tracking is
video credit
Helmut
Grabner
10150Slide credit Jiri Matas
Tracking Formulation - Literature
Surprisingly little is said about tracking in standard textbooks
Limited to optic flow plus some basic trackers eg Lucas-Kanade
Definition (0)
[Forsyth and Ponce Computer Vision A modern approach 2003]
ldquoTracking is the problem of generating an inference about the
motion of an object given a sequence of images
Good solutions of this problem have a variety of applicationshelliprdquo
11150Slide credit Jiri Matas
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
12150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
13150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
14150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Model-based Tracking People and Faces
httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml
Slide Credit Patrick Perez 8150
Tracking is commonly used in practicehellip
9150
bull Tracking is popular research topic for decades
see CVPR ICCV ECCV hellip
bull But there is no online course devoted to trackinghellip
bull nor big coverage in computer vision courses
bull nor it is well covered in textbooks
Is it clear what tracking is
video credit
Helmut
Grabner
10150Slide credit Jiri Matas
Tracking Formulation - Literature
Surprisingly little is said about tracking in standard textbooks
Limited to optic flow plus some basic trackers eg Lucas-Kanade
Definition (0)
[Forsyth and Ponce Computer Vision A modern approach 2003]
ldquoTracking is the problem of generating an inference about the
motion of an object given a sequence of images
Good solutions of this problem have a variety of applicationshelliprdquo
11150Slide credit Jiri Matas
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
12150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
13150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
14150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking is commonly used in practicehellip
9150
bull Tracking is popular research topic for decades
see CVPR ICCV ECCV hellip
bull But there is no online course devoted to trackinghellip
bull nor big coverage in computer vision courses
bull nor it is well covered in textbooks
Is it clear what tracking is
video credit
Helmut
Grabner
10150Slide credit Jiri Matas
Tracking Formulation - Literature
Surprisingly little is said about tracking in standard textbooks
Limited to optic flow plus some basic trackers eg Lucas-Kanade
Definition (0)
[Forsyth and Ponce Computer Vision A modern approach 2003]
ldquoTracking is the problem of generating an inference about the
motion of an object given a sequence of images
Good solutions of this problem have a variety of applicationshelliprdquo
11150Slide credit Jiri Matas
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
12150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
13150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
14150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Is it clear what tracking is
video credit
Helmut
Grabner
10150Slide credit Jiri Matas
Tracking Formulation - Literature
Surprisingly little is said about tracking in standard textbooks
Limited to optic flow plus some basic trackers eg Lucas-Kanade
Definition (0)
[Forsyth and Ponce Computer Vision A modern approach 2003]
ldquoTracking is the problem of generating an inference about the
motion of an object given a sequence of images
Good solutions of this problem have a variety of applicationshelliprdquo
11150Slide credit Jiri Matas
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
12150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
13150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
14150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking Formulation - Literature
Surprisingly little is said about tracking in standard textbooks
Limited to optic flow plus some basic trackers eg Lucas-Kanade
Definition (0)
[Forsyth and Ponce Computer Vision A modern approach 2003]
ldquoTracking is the problem of generating an inference about the
motion of an object given a sequence of images
Good solutions of this problem have a variety of applicationshelliprdquo
11150Slide credit Jiri Matas
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
12150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
13150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
14150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
12150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
13150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
14150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking is Motion Estimation Optic Flow
13150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
14150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking is Motion Estimation Optic Flow
14150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking is Motion Estimation Optic Flow
Motion ldquopatternrdquo Camera tracking
Dense motion field
httpwwwcscmuedu~saadaProjectsCrowdSegmentation
httpwwwyoutubecomwatchv=ckVQrwYIjAs
Sparse motion field estimate
15150Slide credit Jiri Matas
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking is Motion Estimation Optic Flow
16150
Motion field
bull The motion field is the projection of the 3D scene
motion into the image
Slide credit James Hays
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking is Motion Estimation Optic Flow
17150Slide credit Jason Corso
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Optic Flow
Standard formulation
bull At every pixel 2D displacement is estimated between consecutive frames
Missing
bull occlusion ndash disocclusion handling pixels visible in one image only
- in the standard formulation ldquodonrsquot knowrdquo is not an answer
bull considering the 3D nature of the world
bull large displacement handling - only recently addressed (EpicFlow 2015)
Practical issues hindering progress in optic flow
bull is the ground truth ever known
- learning and performance evaluation problematic (synthetic sequences )
bull requires generic regularization (smoothing)
bull failure (assumption validity) not easy to detect
In certain applications tracking is motion estimation on a part of the image
with specific constraints augmented reality sports analysis 18150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Formulation (1) Tracking
Establishing point-to-point correspondences
in consecutive frames of an image sequence
Notes
bull The concept of an ldquoobjectrdquo in FampP definition disappeared
bull If an algorithm correctly established such correspondences
would that be a perfect tracker
bull tracking = motion estimation
Consider the Bolt sequence
19150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Definition (2) Tracking
Given an initial estimate of its position
locate X in a sequence of images
Where X may mean
bull A (rectangular) region
bull An ldquointerest pointrdquo and its neighbourhood
bull An ldquoobjectrdquo
This definition is adopted eg in a recent book by
Maggio and Cavallaro Video Tracking 2011
Smeulders T-PAMI13
Tracking is the analysis of video sequences for the
purpose of establishing the location of the target
over a sequence of frames (time) starting from
the bounding box given in the first frame
20150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Formulation (3) Tracking as Segmentation
J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010
21150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking as model-based segmentation
22150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking as segmentation
httpvisionucsdedu~kbransonresearchcvpr2005html
httpwww2immdtudk~aamtracking
bull heart
23150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
A ldquostandardrdquo CV tracking method output
24150
Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Formulation (4) Tracking
Given an initial estimate of the pose and state of X
In all images in a sequence (in a causal manner)
1 estimate the pose and state of X
2 (optionally) update the model of X
bull Pose any geometric parameter (position scale hellip)
bull State appearance shapesegmentation visibility articulations
bull Model update essentially a semi-supervised learning problem
ndash a priori information (appearance shape dynamics hellip)
ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences
bull Causal for estimation at T use information from time t middot T
25150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking for Black Mirror Blocking
26150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking in 6D
27150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking-Learning-Detection (TLD)
28150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
A ldquomiraclerdquo Tracking a Transparent Object
video credit
Helmut
Grabner
H Grabner H Bischof On-line boosting and vision CVPR 2006
29150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking the ldquoInvisiblerdquo
H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010
30150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
video
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Other Tracking Problems
helliphellip multiple object tracking hellip
32150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Multi-object Tracking
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Other Tracking Problems
Cell division
httpwwwyoutubecomwatchv=rgLJrvoX_qo
Three rounds of cell division in Drosophila Melanogaster
httpwwwyoutubecomwatchv=YFKA647w4Jg
splitting and merging events hellip
34150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
So I want to track
35150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Just link to some computer vision lib
from cv2 import tracker
or
include ltopencv2corehppgt
Or
import orgopencvcoreCore
import orgopencvcoreMat
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
What is here in libraries OpenCV
bull KLT tracker (1981)
bull CAMshift and Meanshift (1998)
bull TLD (2011)
bull MedianFLow (2010)
bull Boosting (2006)
bull MIL (2009)
and KCF (2012) in opencv_contrib38150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
What is here in libraries BoofCV
bull SparseFlow (KLT tracker) (1991)
bull MeanShift (1998)
bull TLD (2011)
bull KCF (2012)
39150
httpsgithubcomlessthanoptimalBoofCV
Computer vision librar
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Bad news OpenCV
40150httpwwwvotchallengenet
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Reference implementation
41150
But authors often publish their own implementation on githubhellip
git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Good news
42150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Good news Not so
43150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Good news Not so
44150
Good newsEasy to contribute to open source
Just implement some modern tracking algorithm
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
45150
So we need to understand how tracking works
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Classic
KLT tracker
46150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
The KLT Tracker
47150Slide credit Kris Kitani
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
KLT Tracker
slide creditTomas Svoboda
48150
Importance in Computer Vision
bull Firstly published in 1981 as an image registration method [3]
bull Improved many times most importantly by Carlo Tomasi [54]
bull Free implementaton(s) available1
bull After more than two decades a project2 at CMU dedicated to this
bull single algorithm and results published in a premium journal [1]
bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Image alignment
slide creditTomas Svoboda 49150
Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Original Lucas-Kanade algorithm I
slide creditTomas Svoboda 50150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Original Lucas-Kanade algorithm II
slide creditTomas Svoboda 51150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Original Lucas-Kanade algorithm III
slide creditTomas Svoboda 52150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Original Lucas-Kanade algorithm IV
slide creditTomas Svoboda 53150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Original Lucas-Kanade summary
slide credit Tomas Svoboda 54150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
KLT Tracker
For good conditioning patch must be
texturedstructured enough
bull Uniform patch no information
bull Contour element aperture problem (one dimensional
information)
bull Corners blobs and texture best estimate
[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]
slide credit
Patrick Perez
55150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Aperture Problem Example
56150Image from Gary Bradski slides
If we look through small holehellip
Video by Olha Mishkina
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Multi-resolution Lucas-Kanade
ndash Arbitrary displacement
bull Multi-resolution approach Gauss-Newton like approximation down image
pyramid
57150Slide credit James Hays
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Monitoring quality
ndash Translation is usually sufficient for small fragments but
bull Perspective transforms and occlusions cause drift and loss
ndash Two complementary options
bull Kill tracklets when minimum SSD too large
bull Compare as well with initial patch under affine transform (warp) assumption
slide credit
Patrick Perez
58150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Characteristics of KLT
bull cost function sum of squared intensity differences
between template and window
bull optimization technique gradient descent
bull model learning no update last frame convex
combination
bull attractive properties
ndashfast
ndasheasily extended to image-to-image transformations with
multiple parameters
59150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
What about deep
learning or
CNN for tracking
60150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
61150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo
Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016
Recap CaffeNet architecture
63
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Recent trackers development
74150httpsgithubcomfoolwoodbenchmark_results
CNN KCF
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Discriminative Tracking (T by Detection)
t=0
samples
labels+1 +1 +1 minus1 minus1 minus1
Classifier
tgt0
hellipClassify subwindows to find target
75
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Connection to Correlation
The Convolution Theorem
ldquoCross-correlation is equivalent to an
element-wise product in Fourier domainrdquo
bull where
ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858
(likewise for ො119857 and ෝ119856)
ndashtimes is element-wise product
ndash lowast is complex-conjugate (ie negate imaginary part)
119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺
bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)
76
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Kernelized Correlation Filters
bull Circulant matrices are a very general tool which allows to replace
standard operations with fast Fourier operations
bull The same idea can by applied eg to the Kernel Ridge Regression
with K kernel matrix Kij = (xi xj) and dual space representation
bull For many kernels circulant data circulant K matrix
bull Diagonalizing with the DFT for learning the classifier yields
120630 = 119870 + 120582119868 minus1119858
119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)
ෝ120630 =ො119858
መ119844119857119857+ 120582
Fast solution in 119978 119899 log 119899 Typical kernel algorithms are
119978 1198992 or higher
rArr
77
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Kernelized Correlation Filters
bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo
bull For Gaussian kernel it yields
119844119857119857prime = exp minus1
2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime
bull Evaluation on subwindows of image z with classifier 120630 and model x
1 119870119859 = 119862 119844119857119859
2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514
bull Update classifier 120630 and model x by linear interpolation from the
location of maximum response f(z)
bull Kernel allows integration of more complex and multi-channel
features
119896119894119857119857prime = (119857prime 119875119894minus1119857)
multiple channels can be concatenated to the vector x and then sum over in this term
78
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Kernelized Correlation Filters
KCF Tracker
bull very few
hyperparameters
bull code fits on one slide
of the presentation
bull Use HoG features
(32 channels)
bull ~300 FPS
bull Open-Source
(MatlabPythonJavaC)
function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)
end
function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))
end
function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))
end
Training and detection (Matlab)
Sum over channel dimensionin kernel computation
79
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
From KCF to Discriminative CF trackers
Basic
bull Henriques et al ndash CSK
ndash raw grayscale pixel values as features
bull Henriques et al ndash KCF
ndash HoG multi-channel features
Further work
bull Danelljan et al ndash DSST
ndash PCA-HoG + grayscale pixels features
ndash filters for translation and for scale (in the scale-space pyramid)
bull Li et al ndash SAMF
ndash HoG color-naming and grayscale pixels features
ndash quantize scale space and normalize each scale to one size by bilinear
inter rarr only one filter on normalized size
80
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Discriminative Correlation Filters Trackers
bull Danelljan et al ndashSRDCF
ndash spatial regularization in the learning process
rarr limits boundary effect
rarr penalize filter coefficients depending on their spatial location
ndash allows to use much larger search region
ndash more discriminative to background (more training data)
CNN-based Correlation Trackers
bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT
2016)
bull Ma et al
ndash features VGG-Net pretrained on ImageNet dataset extracted from
third fourth and fifth convolution layer
ndash for each feature learn a linear correlation filter
CNN-based Trackers (not correlation based)
bull Nam et al ndash MDNet T-CNN
ndash CNN classification (3 convolution layers and 2 fully connected layers)
81
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Evaluation of Trackers
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking Which methods work
83150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Tracking Which methods work
84150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
What works ldquoThe zero-order trackerrdquo
85150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT community evolution
3000
1500
51 Coauthors 14pgs
ICCV2013
57 Coauthors 27pgs
ECCV2014
128 Coauthors 24pgs
ICCV2015
141 Coauthors 44pgs
ECCV2016
+ VOT-TIRpaper
(69 coauth)
+ VOT-TIRpaper
(70 coauth)
VOT2014submission deadline
VOT2015submission deadline
VOT2016submission deadline
VOT2013submission deadline
8639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT challenge evolution
bull Gradual increase of dataset size
bull Gradual refinement of dataset construction
bull Gradual refinement of performance measures
bull Gradual increase of tested trackers
Perf Measures Dataset size Target box Property Trackers tested
VOT2013 ranks A R 16 s manual manual per frame 27
VOT2014 ranks A R EFO 25 s manual manual per frame 38
VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR
VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR
8739
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Class of trackers tested
bull Single-object single-camera
bull Short-term
ndashTrackers performing without re-detection
bull Causality
ndashTracker is not allowed to use any future frames
bull No prior knowledge about the target
ndashOnly a single training example ndash BBox in the first frame
bull Object state encoded by a bounding box
88150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (13) Sequence candidates
ALOV (315 seq)[Smeulders et al2013]
Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences
Example Poorly defined target Example Artificially created
356 sequences
PTR (~50 seq)[Vojir et al2013]
+OTB (~50 seq)
[Wu et al2013]
+
gt30 new sequencesfrom VOT2015
committee
+
443sequences
VOT Automatic Dataset Construction Protocol
cluster + sample
8939
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (23) Clustering
bull Approximately annotate targets
bull 11 global attributes estimated
automatically for 356 sequences (eg blur camera motion object motion)
bull Cluster into K = 28 groups (automatic selection of K)
Feature encoding
11 dim
Affinity Propagation [Frey Dueck 2007]
Cluster similar sequences
9039
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Construction (33) Sampling
bull Requirement
bull Diverse visual attributes
bull Challenging subset
bull Global visual attributes computed
bull Tracking difficulty attribute Applied FoT ASMS KCF trackers
bull Developed a sampling strategy that sampled
challenging sequences while keeping the global
attributes diverse
9139
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT201516 dataset 60 sequences
9239
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
Object annotation
bull Automatic bounding box placement
1 Segment the target (semi-automatic)
2 Automatically fit a bounding box by optimizing a cost function
9339
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Kristan et al VOT2016 results
Sequence ranking
bull Among the most challenging sequences
bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)
9442
Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT Results Realtime
bull Flow-based Mean Shift-based Correlation filters
bull Engineering use of basic features
2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)
2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)
ASMSBDF
FoT
2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)
PLTFoT
CCMS
9639
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Results
bull C-COT slightly ahead of TCNN
bull Most accurate SSAT
bull Most robust C-COT and MLDF
Overlap curves
9742
(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple
C-COT
TCNNSSAT
MLDF
AR-raw plot
Acc
ura
cy
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT 2016 Tracking speed
bull Top-performers slowest bull Plausible cause CNN
bull Real-time bound Staple+
bull Decent accuracy
bull Decent robustness
Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure
9842
C-COT TCNNSSAT MLDF
Staple+
Staple+
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016
VOT public resources
bull Resources publicly available VOT page
bull Raw results of all tested trackers
bull Relevant methodology papers
bull 2016 Submitted trackers codebinaries
bull All fully annotated datasets (2013-2016)
bull Documentation tutorials forum
9939
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Summary
bull ldquoVisual Trackingrdquo may refer to quite different problems
bull The area is just starting to be affected by CNNs
bull Robustness at all levels is the road to reliable performance
bull Key components of trackers
ndash target learning (modelling ldquotemplate updaterdquo)
ndash integration of detection and temporal smoothness assumptions
ndash representation of the image and target
bull Be careful when evaluating tracking results
100150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Computer vision courses online
bull httpscwfelcvutczwikicoursesucuws17start UCU Winter
school course
bull httpcsbrowneducoursescs143
bull httpswwwudacitycomcourseintroduction-to-computer-vision-
-ud810
bull httpcs231nstanfordedu
bull httpvisionstanfordeduteachingcs131_fall1415indexhtml
101150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
A bit of self-PR
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150
Center for Machine Perception
Department of Cybernetics
Faculty of Electrical Engineering
Czech Technical University Prague
established 1707
Area of Expertise Computer Vision Image Processing Pattern Recognition
People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students
Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests
Visual Recognition Group (headJiri Matas)
We are looking for 1-3
good students for PhD
THANK YOU
Questions please
105150