+ All Categories
Home > Technology > Visual Object Tracking: review

Visual Object Tracking: review

Date post: 16-Mar-2018
Category:
Upload: dmytro-mishkin
View: 802 times
Download: 5 times
Share this document with a friend
105
Visual Object Tracking @ Belka Dmytro Mishkin Center for Machine Perception Czech Technical University in Prague [email protected] Kyiv, Ukraine 2017
Transcript

Visual Object Tracking Belka

Dmytro Mishkin

Center for Machine Perception

Czech Technical University in Prague

duchaaikigmailcom

Kyiv Ukraine

2017

My background

bull PhD student of Czech Technical university in Prague

Now fully working in Deep Learning

recent paper ldquoAll you need is a good initrdquo added to

Stanford CS231n course

bull exCTO of Clear Research (clearsx) 2014-2017

bull Co-founder of Szkocka Research Group Ukrainian open

research community for computer science

httpswwwfacebookcomgroups839064726190162

Currently supervising one project related to local

features learning

bull Reviewer for the most impactful computer vision

journals TPAMI and IJCV

2

Lecture is heavily based on

tutorial

Visual Tracking

by Jiri Matas

bdquohellip Although tracking itself is by and

large solved problemldquo

-- Jianbo Shi amp Carlo Tomasi

CVPR1994 --

Application domains of Visual Tracking

bull monitoring assistance surveillance

control defense

bull robotics autonomous car driving

rescue

bull measurements medicine sport

biology meteorology

bull human computer interaction

bull augmented reality

bull film production and postproduction

motion capture editing

bull management of video content indexing

search

bull action and activity recognition

bull image stabilization

bull mobile applications

bull camera ldquotrackingrdquo4150Slide credit Jiri Matas

Applications applications applications hellip

5150Slide credit Jiri Matas

Tracking Applications hellip

ndash Team sports game analysis player statistics video annotation hellip

6150Slide credit Jiri Matas

Sport examples

httpcvlabepflch~lepetithttpwwwdartfishcomenmedia-galleryvideosindexhtm

Slide Credit Patrick Perez 7150

Model-based Tracking People and Faces

httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml

Slide Credit Patrick Perez 8150

Tracking is commonly used in practicehellip

9150

bull Tracking is popular research topic for decades

see CVPR ICCV ECCV hellip

bull But there is no online course devoted to trackinghellip

bull nor big coverage in computer vision courses

bull nor it is well covered in textbooks

Is it clear what tracking is

video credit

Helmut

Grabner

10150Slide credit Jiri Matas

Tracking Formulation - Literature

Surprisingly little is said about tracking in standard textbooks

Limited to optic flow plus some basic trackers eg Lucas-Kanade

Definition (0)

[Forsyth and Ponce Computer Vision A modern approach 2003]

ldquoTracking is the problem of generating an inference about the

motion of an object given a sequence of images

Good solutions of this problem have a variety of applicationshelliprdquo

11150Slide credit Jiri Matas

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

12150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

13150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

14150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

My background

bull PhD student of Czech Technical university in Prague

Now fully working in Deep Learning

recent paper ldquoAll you need is a good initrdquo added to

Stanford CS231n course

bull exCTO of Clear Research (clearsx) 2014-2017

bull Co-founder of Szkocka Research Group Ukrainian open

research community for computer science

httpswwwfacebookcomgroups839064726190162

Currently supervising one project related to local

features learning

bull Reviewer for the most impactful computer vision

journals TPAMI and IJCV

2

Lecture is heavily based on

tutorial

Visual Tracking

by Jiri Matas

bdquohellip Although tracking itself is by and

large solved problemldquo

-- Jianbo Shi amp Carlo Tomasi

CVPR1994 --

Application domains of Visual Tracking

bull monitoring assistance surveillance

control defense

bull robotics autonomous car driving

rescue

bull measurements medicine sport

biology meteorology

bull human computer interaction

bull augmented reality

bull film production and postproduction

motion capture editing

bull management of video content indexing

search

bull action and activity recognition

bull image stabilization

bull mobile applications

bull camera ldquotrackingrdquo4150Slide credit Jiri Matas

Applications applications applications hellip

5150Slide credit Jiri Matas

Tracking Applications hellip

ndash Team sports game analysis player statistics video annotation hellip

6150Slide credit Jiri Matas

Sport examples

httpcvlabepflch~lepetithttpwwwdartfishcomenmedia-galleryvideosindexhtm

Slide Credit Patrick Perez 7150

Model-based Tracking People and Faces

httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml

Slide Credit Patrick Perez 8150

Tracking is commonly used in practicehellip

9150

bull Tracking is popular research topic for decades

see CVPR ICCV ECCV hellip

bull But there is no online course devoted to trackinghellip

bull nor big coverage in computer vision courses

bull nor it is well covered in textbooks

Is it clear what tracking is

video credit

Helmut

Grabner

10150Slide credit Jiri Matas

Tracking Formulation - Literature

Surprisingly little is said about tracking in standard textbooks

Limited to optic flow plus some basic trackers eg Lucas-Kanade

Definition (0)

[Forsyth and Ponce Computer Vision A modern approach 2003]

ldquoTracking is the problem of generating an inference about the

motion of an object given a sequence of images

Good solutions of this problem have a variety of applicationshelliprdquo

11150Slide credit Jiri Matas

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

12150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

13150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

14150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Lecture is heavily based on

tutorial

Visual Tracking

by Jiri Matas

bdquohellip Although tracking itself is by and

large solved problemldquo

-- Jianbo Shi amp Carlo Tomasi

CVPR1994 --

Application domains of Visual Tracking

bull monitoring assistance surveillance

control defense

bull robotics autonomous car driving

rescue

bull measurements medicine sport

biology meteorology

bull human computer interaction

bull augmented reality

bull film production and postproduction

motion capture editing

bull management of video content indexing

search

bull action and activity recognition

bull image stabilization

bull mobile applications

bull camera ldquotrackingrdquo4150Slide credit Jiri Matas

Applications applications applications hellip

5150Slide credit Jiri Matas

Tracking Applications hellip

ndash Team sports game analysis player statistics video annotation hellip

6150Slide credit Jiri Matas

Sport examples

httpcvlabepflch~lepetithttpwwwdartfishcomenmedia-galleryvideosindexhtm

Slide Credit Patrick Perez 7150

Model-based Tracking People and Faces

httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml

Slide Credit Patrick Perez 8150

Tracking is commonly used in practicehellip

9150

bull Tracking is popular research topic for decades

see CVPR ICCV ECCV hellip

bull But there is no online course devoted to trackinghellip

bull nor big coverage in computer vision courses

bull nor it is well covered in textbooks

Is it clear what tracking is

video credit

Helmut

Grabner

10150Slide credit Jiri Matas

Tracking Formulation - Literature

Surprisingly little is said about tracking in standard textbooks

Limited to optic flow plus some basic trackers eg Lucas-Kanade

Definition (0)

[Forsyth and Ponce Computer Vision A modern approach 2003]

ldquoTracking is the problem of generating an inference about the

motion of an object given a sequence of images

Good solutions of this problem have a variety of applicationshelliprdquo

11150Slide credit Jiri Matas

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

12150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

13150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

14150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Application domains of Visual Tracking

bull monitoring assistance surveillance

control defense

bull robotics autonomous car driving

rescue

bull measurements medicine sport

biology meteorology

bull human computer interaction

bull augmented reality

bull film production and postproduction

motion capture editing

bull management of video content indexing

search

bull action and activity recognition

bull image stabilization

bull mobile applications

bull camera ldquotrackingrdquo4150Slide credit Jiri Matas

Applications applications applications hellip

5150Slide credit Jiri Matas

Tracking Applications hellip

ndash Team sports game analysis player statistics video annotation hellip

6150Slide credit Jiri Matas

Sport examples

httpcvlabepflch~lepetithttpwwwdartfishcomenmedia-galleryvideosindexhtm

Slide Credit Patrick Perez 7150

Model-based Tracking People and Faces

httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml

Slide Credit Patrick Perez 8150

Tracking is commonly used in practicehellip

9150

bull Tracking is popular research topic for decades

see CVPR ICCV ECCV hellip

bull But there is no online course devoted to trackinghellip

bull nor big coverage in computer vision courses

bull nor it is well covered in textbooks

Is it clear what tracking is

video credit

Helmut

Grabner

10150Slide credit Jiri Matas

Tracking Formulation - Literature

Surprisingly little is said about tracking in standard textbooks

Limited to optic flow plus some basic trackers eg Lucas-Kanade

Definition (0)

[Forsyth and Ponce Computer Vision A modern approach 2003]

ldquoTracking is the problem of generating an inference about the

motion of an object given a sequence of images

Good solutions of this problem have a variety of applicationshelliprdquo

11150Slide credit Jiri Matas

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

12150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

13150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

14150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Applications applications applications hellip

5150Slide credit Jiri Matas

Tracking Applications hellip

ndash Team sports game analysis player statistics video annotation hellip

6150Slide credit Jiri Matas

Sport examples

httpcvlabepflch~lepetithttpwwwdartfishcomenmedia-galleryvideosindexhtm

Slide Credit Patrick Perez 7150

Model-based Tracking People and Faces

httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml

Slide Credit Patrick Perez 8150

Tracking is commonly used in practicehellip

9150

bull Tracking is popular research topic for decades

see CVPR ICCV ECCV hellip

bull But there is no online course devoted to trackinghellip

bull nor big coverage in computer vision courses

bull nor it is well covered in textbooks

Is it clear what tracking is

video credit

Helmut

Grabner

10150Slide credit Jiri Matas

Tracking Formulation - Literature

Surprisingly little is said about tracking in standard textbooks

Limited to optic flow plus some basic trackers eg Lucas-Kanade

Definition (0)

[Forsyth and Ponce Computer Vision A modern approach 2003]

ldquoTracking is the problem of generating an inference about the

motion of an object given a sequence of images

Good solutions of this problem have a variety of applicationshelliprdquo

11150Slide credit Jiri Matas

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

12150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

13150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

14150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking Applications hellip

ndash Team sports game analysis player statistics video annotation hellip

6150Slide credit Jiri Matas

Sport examples

httpcvlabepflch~lepetithttpwwwdartfishcomenmedia-galleryvideosindexhtm

Slide Credit Patrick Perez 7150

Model-based Tracking People and Faces

httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml

Slide Credit Patrick Perez 8150

Tracking is commonly used in practicehellip

9150

bull Tracking is popular research topic for decades

see CVPR ICCV ECCV hellip

bull But there is no online course devoted to trackinghellip

bull nor big coverage in computer vision courses

bull nor it is well covered in textbooks

Is it clear what tracking is

video credit

Helmut

Grabner

10150Slide credit Jiri Matas

Tracking Formulation - Literature

Surprisingly little is said about tracking in standard textbooks

Limited to optic flow plus some basic trackers eg Lucas-Kanade

Definition (0)

[Forsyth and Ponce Computer Vision A modern approach 2003]

ldquoTracking is the problem of generating an inference about the

motion of an object given a sequence of images

Good solutions of this problem have a variety of applicationshelliprdquo

11150Slide credit Jiri Matas

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

12150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

13150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

14150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Sport examples

httpcvlabepflch~lepetithttpwwwdartfishcomenmedia-galleryvideosindexhtm

Slide Credit Patrick Perez 7150

Model-based Tracking People and Faces

httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml

Slide Credit Patrick Perez 8150

Tracking is commonly used in practicehellip

9150

bull Tracking is popular research topic for decades

see CVPR ICCV ECCV hellip

bull But there is no online course devoted to trackinghellip

bull nor big coverage in computer vision courses

bull nor it is well covered in textbooks

Is it clear what tracking is

video credit

Helmut

Grabner

10150Slide credit Jiri Matas

Tracking Formulation - Literature

Surprisingly little is said about tracking in standard textbooks

Limited to optic flow plus some basic trackers eg Lucas-Kanade

Definition (0)

[Forsyth and Ponce Computer Vision A modern approach 2003]

ldquoTracking is the problem of generating an inference about the

motion of an object given a sequence of images

Good solutions of this problem have a variety of applicationshelliprdquo

11150Slide credit Jiri Matas

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

12150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

13150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

14150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Model-based Tracking People and Faces

httpcvlabepflchresearchcompletedrealtime_tracking httpwwwcsbrownedu~black3Dtrackinghtml

Slide Credit Patrick Perez 8150

Tracking is commonly used in practicehellip

9150

bull Tracking is popular research topic for decades

see CVPR ICCV ECCV hellip

bull But there is no online course devoted to trackinghellip

bull nor big coverage in computer vision courses

bull nor it is well covered in textbooks

Is it clear what tracking is

video credit

Helmut

Grabner

10150Slide credit Jiri Matas

Tracking Formulation - Literature

Surprisingly little is said about tracking in standard textbooks

Limited to optic flow plus some basic trackers eg Lucas-Kanade

Definition (0)

[Forsyth and Ponce Computer Vision A modern approach 2003]

ldquoTracking is the problem of generating an inference about the

motion of an object given a sequence of images

Good solutions of this problem have a variety of applicationshelliprdquo

11150Slide credit Jiri Matas

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

12150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

13150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

14150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking is commonly used in practicehellip

9150

bull Tracking is popular research topic for decades

see CVPR ICCV ECCV hellip

bull But there is no online course devoted to trackinghellip

bull nor big coverage in computer vision courses

bull nor it is well covered in textbooks

Is it clear what tracking is

video credit

Helmut

Grabner

10150Slide credit Jiri Matas

Tracking Formulation - Literature

Surprisingly little is said about tracking in standard textbooks

Limited to optic flow plus some basic trackers eg Lucas-Kanade

Definition (0)

[Forsyth and Ponce Computer Vision A modern approach 2003]

ldquoTracking is the problem of generating an inference about the

motion of an object given a sequence of images

Good solutions of this problem have a variety of applicationshelliprdquo

11150Slide credit Jiri Matas

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

12150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

13150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

14150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Is it clear what tracking is

video credit

Helmut

Grabner

10150Slide credit Jiri Matas

Tracking Formulation - Literature

Surprisingly little is said about tracking in standard textbooks

Limited to optic flow plus some basic trackers eg Lucas-Kanade

Definition (0)

[Forsyth and Ponce Computer Vision A modern approach 2003]

ldquoTracking is the problem of generating an inference about the

motion of an object given a sequence of images

Good solutions of this problem have a variety of applicationshelliprdquo

11150Slide credit Jiri Matas

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

12150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

13150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

14150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking Formulation - Literature

Surprisingly little is said about tracking in standard textbooks

Limited to optic flow plus some basic trackers eg Lucas-Kanade

Definition (0)

[Forsyth and Ponce Computer Vision A modern approach 2003]

ldquoTracking is the problem of generating an inference about the

motion of an object given a sequence of images

Good solutions of this problem have a variety of applicationshelliprdquo

11150Slide credit Jiri Matas

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

12150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

13150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

14150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

12150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

13150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

14150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking is Motion Estimation Optic Flow

13150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

14150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking is Motion Estimation Optic Flow

14150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking is Motion Estimation Optic Flow

Motion ldquopatternrdquo Camera tracking

Dense motion field

httpwwwcscmuedu~saadaProjectsCrowdSegmentation

httpwwwyoutubecomwatchv=ckVQrwYIjAs

Sparse motion field estimate

15150Slide credit Jiri Matas

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking is Motion Estimation Optic Flow

16150

Motion field

bull The motion field is the projection of the 3D scene

motion into the image

Slide credit James Hays

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking is Motion Estimation Optic Flow

17150Slide credit Jason Corso

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Optic Flow

Standard formulation

bull At every pixel 2D displacement is estimated between consecutive frames

Missing

bull occlusion ndash disocclusion handling pixels visible in one image only

- in the standard formulation ldquodonrsquot knowrdquo is not an answer

bull considering the 3D nature of the world

bull large displacement handling - only recently addressed (EpicFlow 2015)

Practical issues hindering progress in optic flow

bull is the ground truth ever known

- learning and performance evaluation problematic (synthetic sequences )

bull requires generic regularization (smoothing)

bull failure (assumption validity) not easy to detect

In certain applications tracking is motion estimation on a part of the image

with specific constraints augmented reality sports analysis 18150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Formulation (1) Tracking

Establishing point-to-point correspondences

in consecutive frames of an image sequence

Notes

bull The concept of an ldquoobjectrdquo in FampP definition disappeared

bull If an algorithm correctly established such correspondences

would that be a perfect tracker

bull tracking = motion estimation

Consider the Bolt sequence

19150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Definition (2) Tracking

Given an initial estimate of its position

locate X in a sequence of images

Where X may mean

bull A (rectangular) region

bull An ldquointerest pointrdquo and its neighbourhood

bull An ldquoobjectrdquo

This definition is adopted eg in a recent book by

Maggio and Cavallaro Video Tracking 2011

Smeulders T-PAMI13

Tracking is the analysis of video sequences for the

purpose of establishing the location of the target

over a sequence of frames (time) starting from

the bounding box given in the first frame

20150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Formulation (3) Tracking as Segmentation

J Fan et al Closed-Loop Adaptation for Robust Tracking ECCV 2010

21150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking as model-based segmentation

22150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking as segmentation

httpvisionucsdedu~kbransonresearchcvpr2005html

httpwww2immdtudk~aamtracking

bull heart

23150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

A ldquostandardrdquo CV tracking method output

24150

Approximate motion estimation approximate segmentationNeither good optic flow neither precise segmentation required

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Formulation (4) Tracking

Given an initial estimate of the pose and state of X

In all images in a sequence (in a causal manner)

1 estimate the pose and state of X

2 (optionally) update the model of X

bull Pose any geometric parameter (position scale hellip)

bull State appearance shapesegmentation visibility articulations

bull Model update essentially a semi-supervised learning problem

ndash a priori information (appearance shape dynamics hellip)

ndash labeled data (ldquotrack thisrdquo) + unlabeled data = the sequences

bull Causal for estimation at T use information from time t middot T

25150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking for Black Mirror Blocking

26150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking in 6D

27150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking-Learning-Detection (TLD)

28150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

A ldquomiraclerdquo Tracking a Transparent Object

video credit

Helmut

Grabner

H Grabner H Bischof On-line boosting and vision CVPR 2006

29150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking the ldquoInvisiblerdquo

H Grabner J Matas L Gool P CattinTracking the invisible learning where the object might be CVPR 2010

30150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

video

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Other Tracking Problems

helliphellip multiple object tracking hellip

32150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Multi-object Tracking

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Other Tracking Problems

Cell division

httpwwwyoutubecomwatchv=rgLJrvoX_qo

Three rounds of cell division in Drosophila Melanogaster

httpwwwyoutubecomwatchv=YFKA647w4Jg

splitting and merging events hellip

34150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

So I want to track

35150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Just link to some computer vision lib

from cv2 import tracker

or

include ltopencv2corehppgt

Or

import orgopencvcoreCore

import orgopencvcoreMat

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

What is here in libraries OpenCV

bull KLT tracker (1981)

bull CAMshift and Meanshift (1998)

bull TLD (2011)

bull MedianFLow (2010)

bull Boosting (2006)

bull MIL (2009)

and KCF (2012) in opencv_contrib38150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

What is here in libraries BoofCV

bull SparseFlow (KLT tracker) (1991)

bull MeanShift (1998)

bull TLD (2011)

bull KCF (2012)

39150

httpsgithubcomlessthanoptimalBoofCV

Computer vision librar

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Bad news OpenCV

40150httpwwwvotchallengenet

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Reference implementation

41150

But authors often publish their own implementation on githubhellip

git clone httpsgithubcommartin-danelljanContinuous-ConvOpgit

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Good news

42150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Good news Not so

43150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Good news Not so

44150

Good newsEasy to contribute to open source

Just implement some modern tracking algorithm

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

45150

So we need to understand how tracking works

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Classic

KLT tracker

46150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

The KLT Tracker

47150Slide credit Kris Kitani

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

KLT Tracker

slide creditTomas Svoboda

48150

Importance in Computer Vision

bull Firstly published in 1981 as an image registration method [3]

bull Improved many times most importantly by Carlo Tomasi [54]

bull Free implementaton(s) available1

bull After more than two decades a project2 at CMU dedicated to this

bull single algorithm and results published in a premium journal [1]

bull Part of plethora computer vision algorithms1httpwwwcesclemsonedu~stbklt2httpwwwricmueduprojectsproject_515html

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Image alignment

slide creditTomas Svoboda 49150

Goal is to align a template image T(x) to an input image I(x) X - columnvector containing image coordinates [x y] The I(x) could be also a smallsubwindow within an image

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Original Lucas-Kanade algorithm I

slide creditTomas Svoboda 50150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Original Lucas-Kanade algorithm II

slide creditTomas Svoboda 51150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Original Lucas-Kanade algorithm III

slide creditTomas Svoboda 52150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Original Lucas-Kanade algorithm IV

slide creditTomas Svoboda 53150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Original Lucas-Kanade summary

slide credit Tomas Svoboda 54150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

KLT Tracker

For good conditioning patch must be

texturedstructured enough

bull Uniform patch no information

bull Contour element aperture problem (one dimensional

information)

bull Corners blobs and texture best estimate

[Lucas and Kanade 1981][Tomasi and Shi CVPRrsquo94]

slide credit

Patrick Perez

55150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Aperture Problem Example

56150Image from Gary Bradski slides

If we look through small holehellip

Video by Olha Mishkina

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Multi-resolution Lucas-Kanade

ndash Arbitrary displacement

bull Multi-resolution approach Gauss-Newton like approximation down image

pyramid

57150Slide credit James Hays

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Monitoring quality

ndash Translation is usually sufficient for small fragments but

bull Perspective transforms and occlusions cause drift and loss

ndash Two complementary options

bull Kill tracklets when minimum SSD too large

bull Compare as well with initial patch under affine transform (warp) assumption

slide credit

Patrick Perez

58150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Characteristics of KLT

bull cost function sum of squared intensity differences

between template and window

bull optimization technique gradient descent

bull model learning no update last frame convex

combination

bull attractive properties

ndashfast

ndasheasily extended to image-to-image transformations with

multiple parameters

59150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

What about deep

learning or

CNN for tracking

60150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

61150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

AlexNet (original)Krizhevsky etal ImageNet Classification with Deep Convolutional Neural Networks 2012CaffeNet Jia etal Caffe Convolutional Architecture for Fast Feature Embedding 2014 Image credit Roberto Matheus Pinheiro Pereira ldquoDeep Learning Talkrdquo

Srinivas etal ldquoA Taxonomy of Deep Convolutional Neural Nets for Computer Visionrdquo 2016

Recap CaffeNet architecture

63

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Recent trackers development

74150httpsgithubcomfoolwoodbenchmark_results

CNN KCF

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Discriminative Tracking (T by Detection)

t=0

samples

labels+1 +1 +1 minus1 minus1 minus1

Classifier

tgt0

hellipClassify subwindows to find target

75

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Connection to Correlation

The Convolution Theorem

ldquoCross-correlation is equivalent to an

element-wise product in Fourier domainrdquo

bull where

ndash ො119858 = ℱ(119858) is the Discrete Fourier Transform (DFT) of 119858

(likewise for ො119857 and ෝ119856)

ndashtimes is element-wise product

ndash lowast is complex-conjugate (ie negate imaginary part)

119858 = 119857⊛119856 ො119858 = ො119857lowast times ෝ119856⟺

bull Note that cross-correlation and the DFT are cyclic(the window wraps at the image edges)

76

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Kernelized Correlation Filters

bull Circulant matrices are a very general tool which allows to replace

standard operations with fast Fourier operations

bull The same idea can by applied eg to the Kernel Ridge Regression

with K kernel matrix Kij = (xi xj) and dual space representation

bull For many kernels circulant data circulant K matrix

bull Diagonalizing with the DFT for learning the classifier yields

120630 = 119870 + 120582119868 minus1119858

119870 = 119862(119844119857119857) where 119844119857119857 is kernel auto-correlaton and the first row of 119870 (small and easy to compute)

ෝ120630 =ො119858

መ119844119857119857+ 120582

Fast solution in 119978 119899 log 119899 Typical kernel algorithms are

119978 1198992 or higher

rArr

77

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Kernelized Correlation Filters

bull The 119844119857119857prime is kernel correlation of two vectors x and xrsquo

bull For Gaussian kernel it yields

119844119857119857prime = exp minus1

2 119857 2+ 119857prime 2 minus 2ℱminus1 ො119857lowast ⊙ ො119857prime

bull Evaluation on subwindows of image z with classifier 120630 and model x

1 119870119859 = 119862 119844119857119859

2 119839(119859) = ℱminus1 መ119844119857119859 ⊙ ෝ120514

bull Update classifier 120630 and model x by linear interpolation from the

location of maximum response f(z)

bull Kernel allows integration of more complex and multi-channel

features

119896119894119857119857prime = (119857prime 119875119894minus1119857)

multiple channels can be concatenated to the vector x and then sum over in this term

78

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Kernelized Correlation Filters

KCF Tracker

bull very few

hyperparameters

bull code fits on one slide

of the presentation

bull Use HoG features

(32 channels)

bull ~300 FPS

bull Open-Source

(MatlabPythonJavaC)

function alphaf = train(x y sigma lambda)k = kernel_correlation(x x sigma)alphaf = fft2(y) (fft2(k) + lambda)

end

function y = detect(alphaf x z sigma)k = kernel_correlation(z x sigma)y = real(ifft2(alphaf fft2(k)))

end

function k = kernel_correlation(x1 x2 sigma)c = ifft2(sum(conj(fft2(x1)) fft2(x2) 3))d = x1()x1() + x2()x2() - 2 ck = exp(-1 sigma^2 abs(d) numel(d))

end

Training and detection (Matlab)

Sum over channel dimensionin kernel computation

79

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

From KCF to Discriminative CF trackers

Basic

bull Henriques et al ndash CSK

ndash raw grayscale pixel values as features

bull Henriques et al ndash KCF

ndash HoG multi-channel features

Further work

bull Danelljan et al ndash DSST

ndash PCA-HoG + grayscale pixels features

ndash filters for translation and for scale (in the scale-space pyramid)

bull Li et al ndash SAMF

ndash HoG color-naming and grayscale pixels features

ndash quantize scale space and normalize each scale to one size by bilinear

inter rarr only one filter on normalized size

80

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Discriminative Correlation Filters Trackers

bull Danelljan et al ndashSRDCF

ndash spatial regularization in the learning process

rarr limits boundary effect

rarr penalize filter coefficients depending on their spatial location

ndash allows to use much larger search region

ndash more discriminative to background (more training data)

CNN-based Correlation Trackers

bull Danelljan et al ndash Deep SRDCF CCOT (best performance in VOT

2016)

bull Ma et al

ndash features VGG-Net pretrained on ImageNet dataset extracted from

third fourth and fifth convolution layer

ndash for each feature learn a linear correlation filter

CNN-based Trackers (not correlation based)

bull Nam et al ndash MDNet T-CNN

ndash CNN classification (3 convolution layers and 2 fully connected layers)

81

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Evaluation of Trackers

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking Which methods work

83150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Tracking Which methods work

84150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

What works ldquoThe zero-order trackerrdquo

85150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT community evolution

3000

1500

51 Coauthors 14pgs

ICCV2013

57 Coauthors 27pgs

ECCV2014

128 Coauthors 24pgs

ICCV2015

141 Coauthors 44pgs

ECCV2016

+ VOT-TIRpaper

(69 coauth)

+ VOT-TIRpaper

(70 coauth)

VOT2014submission deadline

VOT2015submission deadline

VOT2016submission deadline

VOT2013submission deadline

8639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT challenge evolution

bull Gradual increase of dataset size

bull Gradual refinement of dataset construction

bull Gradual refinement of performance measures

bull Gradual increase of tested trackers

Perf Measures Dataset size Target box Property Trackers tested

VOT2013 ranks A R 16 s manual manual per frame 27

VOT2014 ranks A R EFO 25 s manual manual per frame 38

VOT2015 EAO A R EFO 60 fully auto manual per frame 62 VOT 24 VOT-TIR

VOT2016 EAO A R EFO 60 fully auto auto per frame 70 VOT 24 VOT-TIR

8739

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Class of trackers tested

bull Single-object single-camera

bull Short-term

ndashTrackers performing without re-detection

bull Causality

ndashTracker is not allowed to use any future frames

bull No prior knowledge about the target

ndashOnly a single training example ndash BBox in the first frame

bull Object state encoded by a bounding box

88150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (13) Sequence candidates

ALOV (315 seq)[Smeulders et al2013]

Filtered outbull Grayscale sequencesbull lt400 pixels targetsbull Poorly-defined targetsbull Artificially created sequences

Example Poorly defined target Example Artificially created

356 sequences

PTR (~50 seq)[Vojir et al2013]

+OTB (~50 seq)

[Wu et al2013]

+

gt30 new sequencesfrom VOT2015

committee

+

443sequences

VOT Automatic Dataset Construction Protocol

cluster + sample

8939

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (23) Clustering

bull Approximately annotate targets

bull 11 global attributes estimated

automatically for 356 sequences (eg blur camera motion object motion)

bull Cluster into K = 28 groups (automatic selection of K)

Feature encoding

11 dim

Affinity Propagation [Frey Dueck 2007]

Cluster similar sequences

9039

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Construction (33) Sampling

bull Requirement

bull Diverse visual attributes

bull Challenging subset

bull Global visual attributes computed

bull Tracking difficulty attribute Applied FoT ASMS KCF trackers

bull Developed a sampling strategy that sampled

challenging sequences while keeping the global

attributes diverse

9139

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT201516 dataset 60 sequences

9239

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

Object annotation

bull Automatic bounding box placement

1 Segment the target (semi-automatic)

2 Automatically fit a bounding box by optimizing a cost function

9339

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Kristan et al VOT2016 results

Sequence ranking

bull Among the most challenging sequences

bull Among the easiest sequencesSinger1 (119860119891 = 002119872119891 = 4) Octopus (119860119891 = 001119872119891 = 5) Sheep (119860119891 = 002119872119891 = 15)

9442

Matrix (119860119891 = 033 119872119891 = 57) Rabbit(119860119891 = 031 119872119891 = 43) Butterfly (119860119891 = 022119872119891 = 45)

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Main novelty ndash better ground truth bull Each frame manually per-pixel segmentedbull B-boxes automatically generated from the segmentation

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT Results Realtime

bull Flow-based Mean Shift-based Correlation filters

bull Engineering use of basic features

2014FoT (~190 fps)PLT (~112 fps)KCF (~36 fps)

2015ASMS (~172 fps)BDF (~300 fps)FoT (~190 pfs)

ASMSBDF

FoT

2013PLT (~169 fps)FoT (~156 fps)CCMS(~57 fps)

PLTFoT

CCMS

9639

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Results

bull C-COT slightly ahead of TCNN

bull Most accurate SSAT

bull Most robust C-COT and MLDF

Overlap curves

9742

(1) C-COT(2) TCNN(3) SSAT(4) MLDF(5) Staple

C-COT

TCNNSSAT

MLDF

AR-raw plot

Acc

ura

cy

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT 2016 Tracking speed

bull Top-performers slowest bull Plausible cause CNN

bull Real-time bound Staple+

bull Decent accuracy

bull Decent robustness

Note the speed in some Matlab trackers has been significantly underestimated by the toolkit since it was measuring also the Matlabrestart time The EFOs of Matlab trackers are in fact higher than stated in this figure

9842

C-COT TCNNSSAT MLDF

Staple+

Staple+

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Matej Kristan matejkristanfriuni-ljsi DPAEV Workshop ECCV 2016

VOT public resources

bull Resources publicly available VOT page

bull Raw results of all tested trackers

bull Relevant methodology papers

bull 2016 Submitted trackers codebinaries

bull All fully annotated datasets (2013-2016)

bull Documentation tutorials forum

9939

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Summary

bull ldquoVisual Trackingrdquo may refer to quite different problems

bull The area is just starting to be affected by CNNs

bull Robustness at all levels is the road to reliable performance

bull Key components of trackers

ndash target learning (modelling ldquotemplate updaterdquo)

ndash integration of detection and temporal smoothness assumptions

ndash representation of the image and target

bull Be careful when evaluating tracking results

100150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Computer vision courses online

bull httpscwfelcvutczwikicoursesucuws17start UCU Winter

school course

bull httpcsbrowneducoursescs143

bull httpswwwudacitycomcourseintroduction-to-computer-vision-

-ud810

bull httpcs231nstanfordedu

bull httpvisionstanfordeduteachingcs131_fall1415indexhtml

101150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

A bit of self-PR

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

Center for Machine Perception

Department of Cybernetics

Faculty of Electrical Engineering

Czech Technical University Prague

established 1707

Area of Expertise Computer Vision Image Processing Pattern Recognition

People12 Academics 3 full 2 associate 7 assistant professors15 Researchers15 Phd students 15 MSc and BSc students

Impact amp Excellencesignificant of funding (gt75) from EU and private companies collaboration with high-tech companies (Microsoft Google Toyota VW Daimler Samsung Boeing Hitachi Honeywell) numerous science prizes hundreds of ISI citations to our publications per year competitive results in contests

Visual Recognition Group (headJiri Matas)

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

We are looking for 1-3

good students for PhD

THANK YOU

Questions please

105150

THANK YOU

Questions please

105150


Recommended