Multi-Target Tracking by Online Learning of Non-linear...

Multi-Target Tracking by Online Learning of Non-linear Motion Patterns andRobust Appearance Models

Bo Yang and Ram NevatiaInstitute for Robotics and Intelligent Systems, University of Southern California

Los Angeles, CA 90089, USA{yangbo|nevatia}@usc.edu

AbstractWe describe an online approach to learn non-linear mo-

tion patterns and robust appearance models for multi-targettracking in a tracklet association framework. Unlike mostprevious approaches that use linear motion methods only,we online build a non-linear motion map to better explaindirection changes and produce more robust motion affini-ties between tracklets. Moreover, based on the incrementallearned entry/exit map, a multiple instance learning methodis devised to produce strong appearance models for track-ing; positive sample pairs are collected from different track-lets so that training samples have high diversity. Finally,using online learned moving groups, a tracklet completionprocess is introduced to deal with tracklets not reachingentry/exit points. We evaluate our approach on three pub-lic data sets, and show significant improvements comparedwith state-of-art methods.

1. IntroductionMulti-target tracking in real scenes is important for many

applications, such as surveillance, robotics, and human-computer interactions. This problem becomes difficult incrowded scenes with frequent occlusions and interactionsamong the targets.

In recent years, improvement in target detection hasbrought a trend of association based tracking approaches,which associate corresponding detection responses or track-lets (track fragments) into longer tracks [12, 16, 5]. Affini-ties between tracklets, i.e., the linking probabilities, are of-ten evaluated as

𝑃𝑙𝑖𝑛𝑘(𝑇𝑖, 𝑇𝑗) = 𝐴𝑚(𝑇𝑖, 𝑇𝑗)𝐴𝑎(𝑇𝑖, 𝑇𝑗)𝐴𝑡(𝑇𝑖, 𝑇𝑗) (1)where 𝐴𝑚(⋅), 𝐴𝑎(⋅), and 𝐴𝑡(⋅) indicate the motion, ap-pearance and temporal affinities between tracklets 𝑇𝑖 and𝑇𝑗 . A Hungarian algorithm is often used to find the globaloptimum [12, 6]. Though much progress has been made,motion affinity estimation and appearance modeling remainkey issues that limit performance.

A linear motion model is commonly assumed for eachtarget [18, 16, 3, 6]. However, as shown in Figure 1(a),

(a) motion patterns

Entry/Exit

Points

(b) entry/exit points

Moving

Groups

(c) moving groupsFigure 1. Examples of useful information for tracking.

there are often several non-linear motion patterns in a scene.Appearance models are often pre-defined [18, 16] or onlinelearned from a few neighboring frames [6, 15]; trackletswith long gaps are difficult to be associated due to appear-ance changes.

Fortunately, there is often useful knowledge in the scene,such as motion patterns, entry/exit points, and movinggroups, as shown in Figure 1, for solving the above prob-lems. We describe an online learning method which auto-matically finds dynamic non-linear motion patterns and ro-bust appearance models to improve tracking performance.

The framework of our approach is shown in Figure 2.Similar to [5], detection responses are first linked into track-lets by a low level association approach. Based on confidenttracklets1, we online learn a non-linear motion map, whichis a set of non-linear motion patterns, e.g., the orange track-let in Figure 2, used for explaining non-linear gaps betweenother tracklets, e.g., the gap between two blue tracklets inFigure 2. For efficiency purpose, our tracking is done insliding windows one by one; the non-linear motion map isupdated by re-learning for each sliding window, as motionpatterns may change with time.

Meanwhile, an online learning process is adopted to au-tomatically find entry or exit points in a scene, as shown ingreen masks in Figure 2. We limit our approach to staticcameras, where entry/exit points do not change in differ-ent sliding windows and can be learned incrementally. Asshown in Figure 1(b), entry/exit points constrain the start-ing and ending of the trajectories; a tracklet ending beforereaching exit points should be associated with other track-

1The definition is given in Section 3.1.

1

�� !��"�� "�� !��"�� #��$��#%$&��"� �� '��(�� (��"�� )��*��"��

�� "�� )��

��+��

$��#%$&��"��Figure 2. Tracking framework of our approach. Colors of detection responses indicate their belonging tracklets. The non-linear motionmap, entry/exit map, appearance models, and moving groups are all unsupervised online learned. Best viewed in color.

lets. We collect detection responses from different trackletsto form potential positive and negative bags, which are usedin a multiple instance learning approach for building robustappearance models, so that tracklets tend to be associateduntil they reach entry/exit points, even under long gaps.

The Hungarian algorithm is utilized to find the globaloptimum; for the temporal affinity in Equ. 1, we use thesame function as in [6]. For tracklets not reaching entry/exitpoints, we try to find whether they belong to any movinggroups as shown in Figure 1(c). The trajectories of observedtargets in the same groups, e.g., the green tracklet in Figure2, are used to complete those of occluded unobserved ones,e.g., the pink tracklet in Figure 2, so that the latter can reachthe entry/exit points.

The contributions of this paper are:∙ An online learned non-linear motion map for explain-

ing reasonable non-linear motions, as well as a newmotion affinity estimation method.

∙ An incremental learning approach for an entry/exitmap, and a multiple instance learning (MIL) algorithmfor finding robust appearance models.

∙ An algorithm for automatically finding moving groups,which are further used for track completion.

The rest of this paper is organized as follows: relatedwork is discussed Section 2; learning of the motion mapfor motion affinities is given in Section 3; Section 4 de-scribes the entry/exit map estimation and the MIL algorithmfor appearance models; learning moving groups for track-lets completion is presented in Section 5; experiments areshown in Section 6, followed by conclusion in Section 7.

2. Related workMulti-target tracking has been a popular topic for several

years. Visual tracking approaches often track each objectseparately [7, 21], but have difficulties to deal with largenumber of objects in crowded scenes. Association basedapproaches often optimize multiple targets globally and si-multaneously [12, 16, 3, 20], and therefore are more robustto long time occlusions and crowded scenes.

Motion patterns have attracted researchers’ attention inrecent years, and are often learned as priors for estimationof each target’s trajectory [4, 17, 13, 22]. It is often assumedthat targets follow a similar motion pattern; trajectories notfollowing the pattern are penalized. This assumption workswell for extremely crowded scenes, where it is difficult fora single human to move against the main trend. However,in our scenario, persons may move freely. There may bemany motion patterns, and a particular individual may fol-low any one or none of them. The motion patterns are usedto explain non-linear trajectories, but do not produce extrapenalties for individuals not following them.

Robust appearance models are also key for multi-targettracking, and many methods collect training samples onlinefrom tracklets to learn appearance models [2, 5, 6, 8, 15].However, these approaches collect positive samples fromthe same tracklets in a few neighboring frames; therefore,the positive samples often lack diversity. Once there are il-lumination or pose changes, tracklets belonging to the sametarget but with long gaps may have different appearancemodels, as positive samples are always from neighboringframes. On the contrary, our method utilizes entry/exitpoints to find potential correct associations, and positivesamples may come from tracklets with long gaps and thushave high diversities. To the best of our knowledge, therehas been no explicit use of entry/exit points to improve ap-pearance models in multi-target tracking tasks.

Human interactions have also been considered in track-ing. Some focus on using social behaviors to avoid colli-sions for a few persons [14, 11], or finding moving groupsof people to estimate the motion of each individual [10, 19].However, in crowded scenes, heavy occlusions may causevery close persons to merge. In addition, previous movinggroups are used as motion constraints when each member isclearly visible [10, 19]. However we use moving groups tocomplete tracklets when they are not observable due to thefailure of the detector.

3. Motion map learning for motion affinitiesIn this section, we introduce an online learning approach

to find reasonable non-linear motion patterns for each slid-ing window, and use them to produce more precise motionaffinities, i.e., 𝐴𝑚(⋅) in Equ. 1.

3.1. Nonlinear motion map learningAs modern detectors are quite robust, we safely assume

that there are no long time continuous missing detectionsfor a target if there are no occlusions. For unoccluded tar-gets, it is easy to associate tracklets based on linear motionassumptions as gaps are often quite small. However, in along time span, these targets may follow non-linear patterns,which may provide guidance for associations of other track-lets. Similar to [6], we also adopt a multi-level associationapproach, i.e., gradually associate tracklets in multiple stepsinstead of one-time association. Non-linear motion patternslearned from tracklets in previous level are used for currentlevel association.

For current sliding window, a motion map 𝑀 ={𝑇1, 𝑇2, . . . , 𝑇𝑚} is defined as a set of tracklets that in-clude confident non-linear motion patterns. A tracklet 𝑇𝑖 =

{𝑑𝑡𝑖𝑠𝑖 , . . . , 𝑑𝑡𝑖𝑒𝑖 } is a set of detection responses, or interpo-

lated responses, in consecutive frames, where 𝑡𝑖𝑠 and 𝑡𝑖𝑒denote the starting and ending frame numbers, and 𝑑𝑡𝑖 ={𝑝𝑡𝑖, 𝑠𝑡𝑖, 𝑣𝑡𝑖} denote the response at frame 𝑡, including posi-tion 𝑝𝑡𝑖, size 𝑠𝑡𝑖, and velocity vector 𝑣𝑡𝑖 .

Due to possible false alarms or false associations, webuild the motion map only on confident tracklets, each ofwhich satisfies two constraints: 1) it is long enough, e.g.,longer than 50 frames, as false tracklets are mostly shortones; 2) it is not or only lightly occluded by other track-lets, e.g., at most 10% frames having visibility ratios lessthan 70%, as most association errors happen when there areheavy occlusions. Otherwise, a tracklet is classified as anunconfident one.

For each confident tracklet, we remove the linear motionparts in the head or in the tail. If the remaining parts stillsatisfy a non-linear motion pattern, we put it into the mo-tion map. The motion map learning algorithm is shown inAlgorithm 1, where ⟨𝑎, 𝑏⟩ denotes the angle between vector𝑎 and vector 𝑏, and (𝑥, 𝑦) denotes a vector from position 𝑥to 𝑦. The threshold 𝜃 is set to 10 degree in our experiments.

As shown in Figure 2, the learned motion map 𝑀 is aunion of existing non-linear moving tracklets, which areused for explaining non-linear gaps between other trackletsin the next level of association.

3.2. Estimation of motion affinitiesIn most previous work [18, 3], motion affinity is esti-

mated by a linear motion assumption as shown in Figure 3.The affinity score is given as

𝐺(𝑝𝑡𝑎𝑖𝑙+𝑣𝑡𝑎𝑖𝑙Δ𝑡−𝑝ℎ𝑒𝑎𝑑,Σ𝑝)𝐺(𝑝ℎ𝑒𝑎𝑑−𝑣ℎ𝑒𝑎𝑑Δ𝑡−𝑝𝑡𝑎𝑖𝑙,Σ𝑝)(2)

Algorithm 1 The algorithm for learning the motion map.Input: tracklets from previous level {𝑇1, 𝑇2, . . . , 𝑇𝑛}Initialize motion map: 𝑀 = 𝜙For 𝑖 = 1, . . . , 𝑛 do:

∙ If 𝑇𝑖 is not a confident track, continue for next iteration.∙ initialize non-linear motion start frame 𝑡𝑠 = 𝑡𝑖𝑒 and end

frame 𝑡𝑒 = 𝑡𝑖𝑠.∙ For 𝑗 = 𝑡𝑖𝑠 + 1, . . . , 𝑡𝑖𝑒 do:

– if ⟨𝑣𝑗𝑖 , (𝑝𝑡𝑖𝑠𝑖 , 𝑝𝑗𝑖 )⟩ > 𝜃, then 𝑡𝑠 = 𝑗 − 1, break.

∙ For 𝑗 = 𝑡𝑖𝑒 − 1, . . . , 𝑡𝑠 do:

– if ⟨𝑣𝑗𝑖 , (𝑝𝑗𝑖 , 𝑝𝑡𝑖𝑒𝑖 )⟩ > 𝜃, then 𝑡𝑒 = 𝑗 + 1, break.

∙ if 𝑡𝑠 < 𝑡𝑒 & ⟨𝑣𝑡𝑠𝑖 , (𝑝𝑡𝑠𝑖 , 𝑝𝑡𝑒𝑖 )⟩ > 2𝜃 & ⟨𝑣𝑡𝑒 , (𝑝𝑡𝑠𝑖 , 𝑝𝑡𝑒𝑖 )⟩ > 2𝜃,𝑀 = 𝑀 ∪ {𝑇 ∗

𝑖 }, where 𝑇 ∗𝑖 = {𝑑𝑡𝑠𝑖 , . . . , 𝑑𝑡𝑒𝑖 }.

Output: the motion map 𝑀 .

T1T2p

tail

ptail+v

tail∆t

phead

-vhead

∆t

phead

Figure 3. Estimation of motion affinity using linear assumptions.

matched

pair

estimated

path

matched

pair

T1

T2

estimated

positions using

motion map

estimated positions

by linear motion

assumption

T0

Figure 4. Estimations of motion affinity using the motion map.

where Δ𝑡 is the frame difference between 𝑝𝑡𝑎𝑖𝑙 and 𝑝ℎ𝑒𝑎𝑑,and 𝐺(⋅,Σ) is the zero-mean Gaussian function.

Using the linear motion assumption, the motion affinitybetween 𝑇1 and 𝑇2 in Figure 4 would be quite low. How-ever, a non-linear connection between 𝑇1 and 𝑇2 may beexplained by a pattern tracklet 𝑇0 ∈ 𝑀 which is a matchedtracklet for the tail response of 𝑇1, and the head response of𝑇2. 𝑇0 is a matched tracklet for a response 𝑑 = {𝑝, 𝑠, 𝑣} if

∃𝑑𝑖 ∈ 𝑇0 ∥𝑝𝑖 − 𝑝∥ < 𝜔 ⋅min{𝑠𝑖, 𝑠} & ⟨𝑣𝑖, 𝑣⟩ < 𝜃 (3)

where 𝜔 is a weight factor set to 0.5 in our experiment, and𝜃 is the same as used in Algorithm 1. Then the estimatedpath is achieved by a quadratic function mostly satisfyingpositions and velocities at the tail of 𝑇1 and the head of𝑇2. The path is valid only if 𝑇0 is a matched tracklet foreach response in the estimated path. This assures that theestimated path is similar enough to the non-linear patterntracklet, as each pattern is only effective for explanations ofneighboring tracklets with similar motion directions.

Then we still use the Gaussian function similar to that inEqu. 2 but based on non-linear estimation of positions as

2D mapFrame 1100 Frame 1160

T18

T1T16

Figure 5. A nonlinear association example in real case.

shown in Figure 4. If multiple patterns exist, the one thatproduces highest affinity score is used.

Note that we do not use motion patterns as priors topenalize tracklets not following the patterns like previouswork [17, 13, 22]. Targets are not necessarily assumed tofollow any non-linear patterns, but once an individual does,we reduce the penalty for that non-linear motion.

Figure 5 shows a non-linear motion example in a realcase. From the 2D map, we see that there is a directionchange between 𝑇18 and 𝑇1 though they are the same per-son. The linear motion assumption would produce a verylow score for associating these two. However, our approachindicates that a confident tracklet 𝑇16 well explains the di-rection change from 𝑇18 to 𝑇1, and therefore gives a highmotion affinity between 𝑇18 and 𝑇1.

4. MIL using the entry/exit mapAppearance models play important roles in tracking.

Most previous online learning approaches [2, 5, 6, 8] col-lect positive training samples, i.e., responses belonging tothe same target, from the same tracklet within a few frames.However, these responses are likely quite similar and lackdiversity. We further collect potential positive pairs from re-sponses in different tracklets with longer gaps according tothe estimated entry/exit map, so that the diversity is higher;then a multiple instance learning approach is proposed toget robust appearance models.

4.1. Incremental learning of the entry/exit mapAn entry/exit map is a binary image 𝐼 with 0 denot-

ing entry/exit points and 1 denoting others. The entry/exitpoints are positions where a target enters or exits in videoframes. We do not constraint order; an entry point is alsoan exit point and vice versa. We limit our approach to staticcameras, so that the entry/exit map does not change withtime and can be learned incrementally.

We continuously add starting or ending positions of con-fident tracklets into 𝐸, and the neighboring regions of thesepositions are treated as entry/exit points. We assume that allentry/exit points form a convex hull, i.e., all possible pointsare at borders of a real 3D scene (not necessarily borders on2D frames), and a target cannot enter or exit in the middle ofa scene. Based on this assumption, we continuously updatethe entry/exit points by removing those inside the convexhull and adding those outside it. Figure 6 shows an exampleof the update process. The incremental learning algorithmof the entry/exit map is shown in Algorithm 2.

Figure 6. Estimation of entry/exit points. Each circle indicates aestimated entry/exit point; red polygons indicate the convex hullsof all estimated points; blue circles indicate later removed points;yellow circles indicate new added ones.

Algorithm 2 Learning algorithm for the extry/exit map.Input: confident tracklets obtained from previous associationlevel in current sliding window {𝑇1, 𝑇2, . . . , 𝑇𝑚}If this is the first sliding window, initialize entry/exit point set𝐸 = 𝜙, and its convex hull 𝐻 = 𝜙For 𝑖 = 1, . . . ,𝑚 do:

∙ If 𝑝𝑡𝑖𝑠𝑖 is outside 𝐻 , 𝐸 = 𝐸 ∪ {𝑑𝑖}; if 𝑝

𝑡𝑖𝑒𝑖 is outside 𝐻 ,

𝐸 = 𝐸 ∪ {𝑑𝑒}.∙ Update convex hull 𝐻 using current 𝐸.∙ ∀𝑑𝑖 ∈ 𝐸, if 𝑝𝑖 is inside 𝐻 , 𝐸 = 𝐸 − {𝑑𝑖}.

Output: the binary entry/exit map 𝐼 , where 𝐼(𝑝) = 0 if ∃𝑝𝑖 ∈ 𝐸so that ∣∣𝑝− 𝑝𝑖∣∣ < 𝑠𝑖, and 𝐼(𝑝) = 1 otherwise.

4.2. Learning for appearance modelsWith the learned entry/exit map, we identify each track-

let as an entry tracklet, an exit one, both, or neither.Definition 1 An entry tracklet 𝑇 starts at any entry/exitpoint or at the beginning of current sliding window; oth-erwise, it is a non-entry tracklet.The definition of an exit tracklet is similar. For trackletsthat are both entry and exit ones, we call them completedtracklets; otherwise, we call them uncompleted tracklets.

Ideally, any real track should be a completed tracklet.However, due to false alarms, appearance changes and oc-clusions, there are usually many uncompleted tracklets. Foran uncompleted but confident tracklet, e.g., a non-exit track-let, it probably needs to be associated with other tracklets sothat the linked tracklet would be completed.

Figure 7 gives an illustration for our potential train-ing pairs collection. For a non-exit confident tracklet 𝑇1,we collect all tracklets that have motion affinities with 𝑇1

higher than a threshold 𝛾, set to 0.2 in our experiments, aspotential correct associations, e.g., 𝑇3, 𝑇4, and 𝑇5. A posi-tive bag 𝑆 is formed by 𝑆 = {(𝑑𝑖1, 𝑑𝑗3), (𝑑𝑖1, 𝑑𝑘4), (𝑑𝑖1, 𝑑𝑚5 )},where 𝑑𝑖1 ∈ 𝑇1, 𝑑𝑗3 ∈ 𝑇3, 𝑑𝑘4 ∈ 𝑇4, and 𝑑𝑚5 ∈ 𝑇5. At leastone pair in a positive bag is a correct pair, so that the bagsare suitable for MIL. The positive bags provide more pos-itive pairs from long gap tracklets, and make 𝑇1 be moreprobably associated with one of 𝑇3, 𝑇4, and 𝑇5. Note thatfor the unconfident tracklet 𝑇0 (𝑇5 is similar), it is not nec-essary to associate with one of 𝑇3, 𝑇4, or 𝑇5, as unconfidenttracklets may be false alarms, but it may appear in the posi-tive bags of other confident tracklets, e.g., 𝑇3.

T1T2

T3

T4

T5

T0 positive:

negative:

…...

…...

Training Bags

Detection

Response

Tracklet

Legend

Figure 7. Illustrations for training sample collections used in mul-tiple instance learning. Colors of detection responses indicate theirbelonging tracklets. See text for details.

For negative training samples, we adopt the approachused in [6]: responses from tracklets having temporal over-laps are used for forming negative training samples. Thefeature pool as in [6] is used; it is based on color, shape,and texture, and features are extracted from pre-defined re-gions of human responses, e.g., the color histogram of theupper body.

We adopt the multiple instance learning framework usedin [9], and the algorithm is shown in Algorithm 3, where∣𝑥𝑖∣ denotes the number of elements in 𝑥𝑖 and 𝐿 is the log-likelihood of bags to be maximized as

𝐿(𝐻) =∑𝑖

(𝑦𝑖 log 𝑝𝑖 + (1− 𝑦𝑖) log(1− 𝑝𝑖)) (4)

The learned classifier 𝐻 is used as the appearance model,and the appearance affinity scores are computed as in [6].

Algorithm 3 Learning algorithm for appearance models.Input: training bags 𝐵 = {(𝑥𝑖 = {𝑥1

𝑖 , 𝑥2𝑖 , . . .}, 𝑦𝑖)}, where

𝑥𝑗𝑖 = {𝑑𝑝, 𝑑𝑞} indicates a pair of responses, and 𝑦𝑖 ∈ {1, 0}.

Feature pool 𝐹 = {ℎ1, ℎ2, . . . , ℎ𝐾}. Let the number of selectedfeatures be 𝑇 (𝑇 < 𝐾).Initialize classifier function 𝐻 = 0For 𝑖 = 1, . . . , 𝑇 do:

∙ For 𝑘 = 1, . . . ,𝐾 do:– 𝑝𝑘𝑖𝑗 = 1/(1 + exp(−(𝐻(𝑥𝑗

𝑖 ) + ℎ𝑘(𝑥𝑗𝑖 ))))

– 𝑝𝑘𝑖 = 1−∏𝑗(1− 𝑝𝑘𝑖𝑗)

1/∣𝑥𝑖∣

– 𝜔𝑘𝑖𝑗 = (𝑦𝑖 − 𝑝𝑘𝑖 )𝑝

𝑘𝑖𝑗/(∣𝑥𝑖∣𝑝𝑘𝑖 )

∙ Find 𝑘∗ = argmax𝑘

∑𝑖𝑗 𝜔

𝑘𝑖𝑗ℎ𝑘(𝑥

𝑗𝑖 ); ℎ𝑡 = ℎ𝑘∗

∙ Find 𝛼∗ = argmax𝛼 𝐿(𝐻+𝛼ℎ𝑡) by linear search; 𝛼𝑡 = 𝛼∗

∙ 𝐻 = 𝐻 + 𝛼𝑡ℎ𝑡

Output: 𝐻 =∑𝑇

𝑡=1 𝛼𝑡ℎ𝑡

5. Track completionAfter tracklet associations, there are likely several un-

completed but confident tracklets. These tracklets are prob-ably due to occlusions by targets in other tracklets or in clut-tered background where the detector fails. Therefore, it isdifficult to observe them from detection responses. Fortu-nately, people often move in groups, and the group motioncan provide priors for motions of each member in the group.Definition 2 A moving group is a group of people whomove at similar speeds and in similar directions as well askeep close to each other.

Two tracklets 𝑇𝑖 and 𝑇𝑗 belong to the same movinggroup if they satisfy the following constrains (assuming 𝑇𝑗

is equal or longer than 𝑇𝑖):

Frame 2400 Frame 2475 Frame 2495

Figure 8. A moving group example. The occluded target is com-pleted even though the detector fails after frame 2475.

𝑡𝑖𝑠 ≥ 𝑡𝑗𝑠 & 𝑡𝑖𝑒 ≤ 𝑡𝑗𝑒 & 𝑡𝑖𝑒 − 𝑡𝑖𝑠 > 𝜁

∀𝑑𝑡𝑖 ∈ 𝑇𝑖 ∣∣𝑝𝑡𝑖 − 𝑝𝑡𝑗 ∣∣ < 𝜉 & 𝜎∣∣𝑝𝑡𝑖 − 𝑝𝑡𝑗 ∣∣ < 𝜏 (5)where 𝜁 is a threshold for minimum co-moving frames, 𝜉is a threshold for maximum distance, 𝜎 is the standard de-viation function, and 𝜏 is a threshold for maximum stan-dard deviation. In our experiments, we set 𝜁 = 50, 𝜉 =0.5min(𝑠𝑡𝑖, 𝑠

𝑡𝑗), and 𝜏 = 0.2min(𝑠𝑡𝑖, 𝑠

𝑡𝑗). Equ. 5 assures

that the two tracklets are close enough for a certain time in-dicating a possible moving group. If 𝑇𝑖 disappears beforereaching the exit points but 𝑇𝑗 is still visible, then we cancomplete 𝑇𝑖, i.e., make 𝑇𝑖 reach the entry/exit points, fromfuture positions of 𝑇𝑗 at frame 𝑡 > 𝑡𝑖𝑒 as

𝑝𝑡𝑖 = 𝑝𝑡𝑗 + 𝑝𝑘𝑖 − 𝑝𝑘𝑗 (6)

where 𝑝𝑘𝑖 − 𝑝𝑘𝑗 denotes the average difference vector be-tween 𝑝𝑘𝑖 and 𝑝𝑘𝑗 based on last 20 frames co-moving pat-terns. There may be multiple tracklets in a moving group,and the position is taken as the average of estimations fromall of them. Note that there is a maximum completion framenumber, which is set to 20 in our experiments. 𝑇𝑖 hasto reach the entry/exit points after the completion process;otherwise, the completion will not be applied, as movinggroups may change after long time.

Though moving groups are also used in [10, 19], groupmotions are used as constraints for each target while allmembers are clearly visible. However, we use them to com-plete tracklets while no reliable observations are available.

Figure 8 shows an example of our tracklet completionapproach. The person 30 and 33 are co-moving for a longtime. After frame 2475, the detector fails to find person30 due to occlusions. However, based on the trajectory ofperson 33, we can estimate the missing trajectory of person30, so that it reaches the exit point after the completion.

6. ExperimentsWe applied our approach to three public data sets: PETS

2009, CAVIAR and Trecvid 2008, which have been com-monly used in previous multi-target tracking work. For fairevaluation, we use the same offline learned human detectorresponses as used in the compared approaches. Low levelassociation is performed as in [5]: responses in neighboringframes with high similarity of position, size, and appear-ances are associated if they have very low similarities withall other responses.

For evaluation, we adopt the commonly used metrics in

[20, 6]: recall & precision, showing detection performanceafter tracking; false alarms per frame (FAF); mostly tracked(MT) & mostly lost (ML), the ratio of tracks with success-fully tracked parts for more than 80% or less than 20%respectively; partially tracked (PT), 1-MT-ML; fragments(Frag), the number of times that a ground truth trajectory isinterrupted; id switches (IDS), the number of times that atracked trajectory changes its matched id. All data used inour experiments are publicly available2.6.1. Entry/exit map estimation

The estimations of entry/exit maps for all five scenesused in our experiments are shown in Figure 9. We can seethat with time, our approach produces more and more pre-cise estimations. Note that the maps are used for improvingtracking not for scene understanding; for some entry/exitpoints, if no targets appear or disappear there, they wouldhave no influence on tracking. Therefore, the precision ismore important than recall. We can see that nearly all ourestimated points are correct positions. Some imprecise es-timations occur in top parts of images in the second row,because humans are really small on top parts of this sceneand detectors would fail there. Therefore, the tracklets of-ten end before reaching the top parts. However, as there arealmost no detection responses there, such estimation wouldhave little negative influence on the tracking performance.6.2. Performance on less crowded data sets

PETS 2009 and CAVIAR are two commonly used datasets for multi-target tracking. Scenes are not crowded, andpeople are sometimes occluded by other humans or otherobjects. The PETS 2009 data are the same as used in [1],but we modify the ground truth annotations, so that peoplewho are fully occluded for many frames but are visible laterare labeled with the same id. For CAVIAR, we test ourapproach on the same 20 video clips used in [5, 6].

The comparison results are shown in Table 2. We can seethat our approach produces obvious improvements; frag-ments are greatly reduced on both data sets by over 50%and 35% respectively, while keeping other scores compet-itive or with some improvements. Some visual results areshown in Figure 10(a). We can see that for almost totallyoverlapped persons, our tracker does not confuse their iden-tities and finds the correct associations.

6.3. Performance on Trecvid 2008 data setAs the PETS 2009 and CAVIAR data sets are relatively

easy, we show more results on the difficult Trecvid 2008data set. There are frequently heavy occlusions, many non-linear motions, and interactions between people. There arethree different scenes in the test videos with three 5000frames video clips for each scene.

The quantitative comparison results are shown in Table3. To show effectiveness of each component of our ap-

2http://iris.usc.edu/people/yangbo/downloads.html

Frame 1600 Frame 4800

Frame 4800


Frame 1600


Frame 200 Frame 700

Figure 9. Entry/exit maps estimation results. The third columnshows ground truths.

proach, except for final performance, we also report threeadditional results where only one component of our ap-proach is activated in each. In addition, to see the effec-tiveness of our estimation of entry/exit maps, we also reportthe performance using manually assigned maps.

From Table 3, we see that by using only any of the sin-gle components of our approach, the overall performance isbetter than the up-to-date approaches. By using all compo-nents, our approach reduces around 14.5% fragments and10.5% id switches compared with the most up-to-date per-formance reported in [6], while keeping other evaluationscores similar or improved. Using the manually assignedmaps does not provide large extra improvements, indicat-ing the effectiveness of our learning method of entry/exitmaps; the few improvements mostly happen in early frames,when the estimated maps have not been well learned due tofew confident tracklets. Though there are three video clipsfor each scene, the entry/exit maps are not shared and arelearned separately for fair comparison.

Figure 10(b)-(d) show visual results of our approach. InFigure 10(b), a woman has non-linear motions and is almostfully occluded by person 170 around frame 4262. Tradi-

Method Recall Precision FAF GT MT PT ML Frag IDSEnergy Minimization [1] - - - 23 82.6% 17.4% 0.0% 21 15PRIMPT [6] 89.5% 99.6% 0.020 19 78.9% 21.1% 0.0% 23 1Our approach 91.8% 99.0% 0.053 19 89.5% 10.5% 0.0% 9 0

Table 1. Comparison of results on PETS 2009 dataset. The PRIMPT results are provided by courtesy of authors of [6]. Our ground truth ismore strict than that in [1].

Method Recall Precision FAF GT MT PT ML Frag IDSOLDAMs [5] 89.4% 96.9% 0.085 143 84.6% 14.7% 0.7% 18 11PRIMPT [6] 88.1% 96.6% 0.082 143 86.0% 13.3% 0.7% 17 4Our approach 90.2% 96.1% 0.095 143 89.1% 10.2% 0.7% 11 5

Table 2. Comparison of results on CAVIAR dataset. The human detection results are the same as used in [5, 6], and are provided bycourtesy of authors of [6].

tional linear motion assumptions cannot connect trackletsbefore and after the occlusion; however, with the help of anonline learned motion map, we successfully associate thesetracklets into one. In Figure 10(c), person 41 is fragmentedfrom frame 2075 to frame 2105 because of heavy occlu-sions, and his appearance changes from almost black to darkblue due to illumination change. However, our MIL algo-rithm is able to produce high appearance similarity betweenthe two tracklets, and links them successfully. In Figure10(d), our approach finds that person 50,51,and 54 form amoving group. After frame 1265, only person 51 is visi-ble; however, we complete trajectories of person 50 and 54according to the visible tracklet 51.

6.4. Computation speedThe speed is highly related with the number of targets in

a video. Our approach is implemented using C++ on a PCwith 3.0GHz CPU and 8GB memory. The average speedsare 48 fps, 16 fps, and 6 fps for CAVIAR, PETS 2009, andTrecvid 2008, respectively. Comparing 48 fps and 7 fps forCAVIAR and Trecvid 2008 reported in [6], our approachdoes not provide much extra computational burden3. In ourexperiments, most of the computation is spent on featureextraction, followed by low-level association.

7. ConclusionWe described an online learning approach for multi-

target tracking. We learn a non-linear motion map, describea multiple instance learning algorithm for better appearancemodels based on estimated entry/exit points, as well as com-plete tracklets based on moving groups. Our approach im-proves the performance a lot compared with up-to-date ap-proaches, while adding little extra computation cost.

AcknowledgmentThis paper is based upon work supported in part by Of-

fice of Naval Research under grant number N00014-10-1-0517.

References[1] A. Andriyenko and K. Schindler. Multi-target tracking by continuous

energy minimization. In CVPR, 2011. 6, 7

3Detection time is not included in both our speed and that in [6]

[2] B. Babenko, M.-H. Yang, and S. Belongie. Visual Tracking withOnline Multiple Instance Learning. In CVPR, 2009. 2, 4

[3] B. Benfold and I. Reid. Stable multi-target tracking in real-timesurveillance video. In CVPR, 2011. 1, 2, 3

[4] L. Kratz and K. Nishino. Tracking with local spatio-temporal motionpatterns in extremely crowded scenes. In CVPR, 2010. 2

[5] C.-H. Kuo, C. Huang, and R. Nevatia. Multi-target tracking by on-line learned discriminative appearance models. In CVPR, 2010. 1, 2,4, 5, 6, 7, 8

[6] C.-H. Kuo and R. Nevatia. How does person identity recognitionhelp multi-person tracking? In CVPR, 2011. 1, 2, 3, 4, 5, 6, 7, 8

[7] Y. Li, H. Ai, T. Yamashita, S. Lao, and M. Kawade. Tracking inlow frame rate video: A cascade particle filter with discriminativeobservers of different lifespans. In CVPR, 2007. 2

[8] W.-L. Lu, J.-A. Ting, K. P. Murphy, and J. J. Little. Identifying play-ers in broadcast sports videos using conditional random fields. InCVPR, 2011. 2, 4

[9] Z. Ni, S. Sunderrajan, A. Rahimi, and B. Manjunath. Particle filtertrackingwith online multiple instance learning. In ICPR, 2010. 5

[10] S. Pellegrini, A. Ess, and L. V. Gool. Improving data association byjoint modeling of pedestrian trajectories and groupings. In ECCV,2010. 2, 5

[11] S. Pellegrini, A. Ess, K. Schindler, and L. V. Gool. You’ll neverwalkalone: Modeling social behavior for multi-target tracking. In ICCV,2009. 2

[12] A. G. A. Perera, C. Srinivas, A. Hoogs, G. Brooksby, and W. Hu.Multi-object tracking through simultaneous long occlusions andsplit-merge conditions. In CVPR, 2006. 1, 2

[13] M. Rodriguez, I. Laptev, J. Sivic, and J.-Y. Audibert. Density-awareperson detection and tracking in crowds. In ICCV, 2011. 2, 4

[14] P. Scovanner and M. F. Tappen. Learning pedestrian dynamics fromthe realworld. In ICCV, 2009. 2

[15] H. B. Shitrit, J. Berclaz, F. Fleuret, and P. Fua. Tracking multiplepeople under global appearance constraints. In ICCV, 2011. 1, 2

[16] B. Song, T.-Y. Jeng, E. Staudt, and A. K. Roy-Chowdhury. Astochastic graph evolution framework for robust multi-target track-ing. In ECCV, 2010. 1, 2

[17] X. Song, X. Shao, H. Zhao, J. Cui, R. Shibasaki, and H. Zha. Anonline approach: Learning-semantic-scene-by-tracking and tracking-by-learning-semantic-scene. In CVPR, 2010. 2, 4

[18] J. Xing, H. Ai, and S. Lao. Multi-object tracking through occlusionsby local tracklets filtering and global tracklets association with de-tection responses. In CVPR, 2009. 1, 3

[19] K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg. Who are youwith and where are you going? In CVPR, 2011. 2, 5

[20] B. Yang, C. Huang, and R. Nevatia. Learning affinities and depen-dencies for multi-target tracking using a crf model. In CVPR, 2011.2, 6, 8

[21] M. Yang, F. Lv, W. Xu, and Y. Gong. Detection driven adaptivemulti-cue integration for multiple human tracking. In ICCV, 2009. 2

[22] X. Zhao and G. Medioni. Robust unsupervised motion pattern infer-ence from video and applications. In ICCV, 2011. 2, 4

Method Recall Precision FAF GT MT PT ML Frag IDSCRF Tracking [20] 79.2% 85.8% 0.996 919 78.2% 16.9% 4.9% 319 253OLDAMs [5] 80.4% 86.1% 0.992 919 76.1% 19.3% 4.6% 322 224PRIMPT [6] 79.2% 86.8% 0.920 919 77.0% 17.7% 5.2% 283 171Motion map only 79.5% 87.8% 0.853 919 75.1% 19.2% 5.7% 255 155MIL only 80.2% 87.8% 0.855 919 77.1% 17.7% 5.2% 250 167Track completion only 80.1% 87.3% 0.895 919 76.7% 17.6% 5.7% 258 169Our approach 80.2% 87.5% 0.880 919 76.9% 17.6% 5.5% 242 153Our approach w/ manual maps 79.5% 87.1% 0.903 919 77.0% 17.5% 5.5% 235 152

Table 3. Comparison of results on TRECVID 2008 dataset. The human detection results are the same as used in [20, 5, 6], and are providedby courtesy of authors of [6].

Frame 180 Frame 265 Frame 325 Frame 360Frame225

Frame 675 Frame 683 Frame 703 Frame 725 Frame 730

(a) Do not confuse identities when targets are quite close.

Frame 4140 Frame 4262 Frame 4290 Frame 4360Frame 4230

(b) Track non-linear moving targets under heavy occlusions.


(c) Track targets when appearances change a lot.


(d) Complete tracklets to exit points by moving groups.Figure 10. Examples of tracking results of our approach on PETS 2009, CAVIAR and Trecvid 2008 data sets.

Date post:	23-Apr-2018
Category:	Documents
Upload:	vanhanh
View:	216 times
Download:	3 times

Multi-Target Tracking by Online Learning of Non-linear...

Documents