+ All Categories
Home > Documents > Instance-Aware Representation Learning and Association for ... · Instance-Aware Representation...

Instance-Aware Representation Learning and Association for ... · Instance-Aware Representation...

Date post: 13-Jul-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
12
Instance-Aware Representation Learning and Association for Online Multi-Person Tracking Hefeng Wu a,b,f , Yafei Hu c , Keze Wang d , Hanhui Li e , Lin Nie a,* , Hui Cheng a a Sun Yat-sen University, Guangzhou 510006, China b Guangdong University of Foreign Studies, Guangzhou 510006, China c Carnegie Mellon University, Pittsburgh, PA 15213, USA d University of California, Los Angeles, CA 90024, USA e Guilin University of Electronic Technology, Guilin 541004, China f WINNER Technology, China Abstract Multi-Person Tracking (MPT) is often addressed within the detection-to-association paradigm. In such approaches, human detections are first extracted in every frame and person trajectories are then recovered by a procedure of data association (usually offline). However, their performances usually degenerate in presence of detection errors, mutual interactions and occlusions. In this paper, we present a deep learning based MPT approach that learns instance-aware representations of tracked persons and robustly online infers states of the tracked persons. Specifically, we design a multi- branch neural network (MBN), which predicts the classification confidences and locations of all targets by taking a batch of candidate regions as input. In our MBN architecture, each branch (instance-subnet) corresponds to an individual to be tracked and new branches can be dynamically created for handling newly appearing persons. Then based on the output of MBN, we construct a joint association matrix that represents meaningful states of tracked persons (e.g., being tracked or disappearing from the scene) and solve it by using the efficient Hungarian algorithm. Moreover, we allow the instance-subnets to be updated during tracking by online mining hard examples, accounting to person appearance variations over time. We comprehensively evaluate our framework on a popular MPT benchmark, demonstrating its excellent performance in comparison with recent online MPT methods. Keywords: Representation Learning, Online Tracking, Multi-Person Tracking, Data Association 1. Introduction Multi-Person Tracking (MPT), as a key component of several intelligent applications such as automatic driving and video surveillance, has attracted special attention be- yond general object tracking. The goal of MPT is to esti- mate the states of multiple observed persons while pre- serving their identifications under appearance variation over time. Existing MPT methods are mainly developed within the detection-to-association paradigm, where hu- man in each frame are usually detected by pre-trained clas- sifiers and associated for identifying the trajectories of per- sons throughout video sequences. Recently proposed MPT methods have shown impressive performance improvement thanks to the development of object (pedestrian) detectors (e.g., deep learning based models). Nevertheless, the prob- lem still remains unsolved in complex scenes (see Fig. 1 for examples) due to the following reasons: * Corresponding author is Lin Nie. Email addresses: [email protected] (Hefeng Wu), [email protected] (Lin Nie) Mutual interactions and occlusions of moving per- sons usually degenerate the performances of human detectors and the resulting false positive detections increase the complexity of conserving person identi- fications. It is quite difficult to handle ambiguities caused by person appearance and motion variations through- out sequences. Some offline methods (i.e., by exploit- ing detections from a span of deferred observations) are usually adopted but not suitable for realistic ap- plications (i.e., working with less observed data). To address the abovementioned issues, in this work we propose to amend the traditional detection-to-association paradigm by learning instance-aware person representa- tions. Unlike the existing methods that usually employ generic (category-level) human detectors, our approach targets on assigning each moving person a specific tracker to reduce ambiguities in complex scenes. Additionally, modern advances in the development of deep feature rep- resentation learning [1, 2, 3] for object appearance have created new opportunities for MPT methods, which par- Preprint submitted to Pattern Recognition May 30, 2019 arXiv:1905.12409v1 [cs.CV] 29 May 2019
Transcript
Page 1: Instance-Aware Representation Learning and Association for ... · Instance-Aware Representation Learning and Association for Online Multi-Person Tracking Hefeng Wua,b,f, Yafei Huc,

Instance-Aware Representation Learning and Associationfor Online Multi-Person Tracking

Hefeng Wua,b,f, Yafei Huc, Keze Wangd, Hanhui Lie, Lin Niea,∗, Hui Chenga

aSun Yat-sen University, Guangzhou 510006, ChinabGuangdong University of Foreign Studies, Guangzhou 510006, China

cCarnegie Mellon University, Pittsburgh, PA 15213, USAdUniversity of California, Los Angeles, CA 90024, USA

eGuilin University of Electronic Technology, Guilin 541004, ChinafWINNER Technology, China

Abstract

Multi-Person Tracking (MPT) is often addressed within the detection-to-association paradigm. In such approaches,human detections are first extracted in every frame and person trajectories are then recovered by a procedure of dataassociation (usually offline). However, their performances usually degenerate in presence of detection errors, mutualinteractions and occlusions. In this paper, we present a deep learning based MPT approach that learns instance-awarerepresentations of tracked persons and robustly online infers states of the tracked persons. Specifically, we design a multi-branch neural network (MBN), which predicts the classification confidences and locations of all targets by taking a batchof candidate regions as input. In our MBN architecture, each branch (instance-subnet) corresponds to an individualto be tracked and new branches can be dynamically created for handling newly appearing persons. Then based on theoutput of MBN, we construct a joint association matrix that represents meaningful states of tracked persons (e.g., beingtracked or disappearing from the scene) and solve it by using the efficient Hungarian algorithm. Moreover, we allowthe instance-subnets to be updated during tracking by online mining hard examples, accounting to person appearancevariations over time. We comprehensively evaluate our framework on a popular MPT benchmark, demonstrating itsexcellent performance in comparison with recent online MPT methods.

Keywords:Representation Learning, Online Tracking, Multi-Person Tracking, Data Association

1. Introduction

Multi-Person Tracking (MPT), as a key component ofseveral intelligent applications such as automatic drivingand video surveillance, has attracted special attention be-yond general object tracking. The goal of MPT is to esti-mate the states of multiple observed persons while pre-serving their identifications under appearance variationover time. Existing MPT methods are mainly developedwithin the detection-to-association paradigm, where hu-man in each frame are usually detected by pre-trained clas-sifiers and associated for identifying the trajectories of per-sons throughout video sequences. Recently proposed MPTmethods have shown impressive performance improvementthanks to the development of object (pedestrian) detectors(e.g., deep learning based models). Nevertheless, the prob-lem still remains unsolved in complex scenes (see Fig. 1for examples) due to the following reasons:

∗Corresponding author is Lin Nie.Email addresses: [email protected] (Hefeng Wu),

[email protected] (Lin Nie)

• Mutual interactions and occlusions of moving per-sons usually degenerate the performances of humandetectors and the resulting false positive detectionsincrease the complexity of conserving person identi-fications.

• It is quite difficult to handle ambiguities caused byperson appearance and motion variations through-out sequences. Some offline methods (i.e., by exploit-ing detections from a span of deferred observations)are usually adopted but not suitable for realistic ap-plications (i.e., working with less observed data).

To address the abovementioned issues, in this work wepropose to amend the traditional detection-to-associationparadigm by learning instance-aware person representa-tions. Unlike the existing methods that usually employgeneric (category-level) human detectors, our approachtargets on assigning each moving person a specific trackerto reduce ambiguities in complex scenes. Additionally,modern advances in the development of deep feature rep-resentation learning [1, 2, 3] for object appearance havecreated new opportunities for MPT methods, which par-

Preprint submitted to Pattern Recognition May 30, 2019

arX

iv:1

905.

1240

9v1

[cs

.CV

] 2

9 M

ay 2

019

Page 2: Instance-Aware Representation Learning and Association for ... · Instance-Aware Representation Learning and Association for Online Multi-Person Tracking Hefeng Wua,b,f, Yafei Huc,

Figure 1: Ambiguities of multi-person tracking arise under complexscenarios such as unknown numbers of targets, mutual interactions,occlusions over time.

tially motivate us to learn instance-level object representa-tions by deep neural nets. Therefore, we develop a multi-branch neural network (MBN) that dynamically learnsinstance-level representations of tracked persons at a lowcost, which facilitates robustly online data association formultiple target tracking and thus gives birth to our INstance-Aware Representation Learning and Association (INARLA)framework.

The proposed MBN architecture consists of three maincomponents: i) a shared backbone-net for extracting con-volutional features of input regions, ii) a det-pruning-subnetfor rejecting the regions from human detection proposalsand iii) a variable number of instance-subnets for mea-suring the confidence of the remaining candidate regionswith respect to the tracked targets. Each instance-subnetexplicitly corresponds to an individual in the scene andcan be online updated by mining hard examples. More-over, new instance-subnets can be dynamically created tohandle newly appearing targets. In this way, our MBNenables to improve the trackers’ robustness by adaptivelycapturing appearance variations for all the targets overtime. Moreover, it is beneficial to relieve the burden of thefollowing step of data association. Traditional detection-to-association trackers usually rely on an expensive stepfor associating observed data with trajectories (identifi-cations) by establishing spatio-temporal coherence, espe-cially for those offline methods [4, 5]. In contrast, ourINARLA framework handles it in a simple and efficientway, thanks to the MBN that can provide powerful instance-level affinity measures for the observed regions. Specif-ically, we construct a joint association matrix based onthe outputs of MBN. This matrix can be divided into fourblocks that represent meaningful states of tracked persons(e.g., being tracked or disappearing from the scene), andit results in a standard assignment problem that can besolved efficiently by the Hungarian algorithm [6]. In sum,our approach handles the problem of online multi-persontracking with the following steps: i) initializing generic hu-man detections in an input video frame; ii) pruning low-confidence human detections via the det-pruning-subnet;iii) predicting the location of each being tracked individ-ual via its corresponding instance-subnet; iv) inferring thestates of all targets by constructing an association ma-trix with results of step ii) and iii); v) making the MBNnetwork updated according to the inferred states of the

targets.The main contributions of this paper are summarized

as follows. First, it presents a novel deep multi-branchneural network that enables dynamically instance-awarerepresentation learning to address realistic challenges inmulti-person tracking. Second, it presents a simple yet ef-fective solver for data association based on the deep archi-tecture, which is capable of inferring the states of trackedindividuals in a frame-by-frame way. Experimental resultson a standard benchmark underline our method’s favor-able performance in comparison with existing multi-persontracking methods.

2. Related Work

In literature much efforts have been dedicated in multi-object tracking (MOT), and we review them according totheir main technical components, i.e., object representa-tion and data association.

2.1. Object representation

How to represent objects plays an important role inMOT for affinity computation or linking object detectionsacross frames. Many different cues have been presented inthe literature, e.g., appearance, location and motion.

Earlier MOT works mostly adopt hand-crafted featuresfor object representation [7, 8, 9, 10]. Color histograms arecommonly used to represent object appearance in multi-object tracking [7, 11], and histograms of oriented gra-dients (HOG) [8] is also a popular choice [12, 13]. In[9], optical flow that reflects the motion information isincorporated for object representation. In addition, ap-propriate fusion of multiple cues can yield improved re-sults [14, 15, 16]. Moreover, sophisticated machine learn-ing techniques [11, 17] are introduced to better describeobject appearance models. However, conventional objectrepresentation methods are often badly affected by chal-lenging factors like illumination variations, object defor-mation, background clutters, etc., which limits their per-formance and generalization ability to various complex sce-narios.

Recently, researchers actively learn object appearancefeatures with deep learning based models due to theirpowerful representation learning ability, e.g., convolutionalneural networks (CNNs) [18, 19] and recurrent neural net-works (RNNs) [20, 21]. A fully convolutional neural net-work is adopted in [18] for object tracking, where fea-tures from top and lower layers that characterize the targetfrom different perspectives are jointly used with a switchmechanism. In [20], a recurrently target-attending track-ing method is presented, which attempts to identify andexploit reliable parts that are beneficial for the trackingprocess. But these mentioned deep learning based meth-ods mainly focus on single object tracking with the objectbeing indicated at the first video frame. As for MOT, re-cently Leal-Taixe et al. [22] exploit siamese CNN for pair-wise pedestrian similarity mesurement in offline tracking,

2

Page 3: Instance-Aware Representation Learning and Association for ... · Instance-Aware Representation Learning and Association for Online Multi-Person Tracking Hefeng Wua,b,f, Yafei Huc,

state inference

96

256 384 384 256 256*6*6 256*1*1 2

{

RoI pooling

instance-subnet

det-pruning-subnet

multi-branch network

batchaveraging

backbone-subnet

Figure 2: Illustration of our INARLA framework for multi-person tracking. The left side is the architecture of the MBN network. The topmostbranch (det-pruning-subnet) excludes false person detections, while the other branches (instance-subnets) predict their corresponding targetsindependently. Based on the outputs of MBN, we propose an efficient algorithm to jointly infer the state of each person. Best viewed incolour.

while Gaidon and Vig [23] take advantage of the convo-lutional features in online domain adaption between in-stances and category in a Bayesian tracking framework.Different from these methods, in this paper we employ aMBN network for instance-aware object representations,in which a backbone-subnet is trained with a novel multi-task loss and instance-subnets are dynamically initializedfrom a det-pruning-subnet and trained discriminativelyonline.

2.2. Data associationTo address the data association problem, existing MOT

works can mainly be roughly divided into two categories:offline methods [4, 5, 15] and online methods [14, 23, 24].

Most MOT methods belong to the first category andprocess the video in an offline way, where the data as-sociation is optimized over the whole video or a span offrames and requires future frames to determine objects’states in the current frame. Network flow-based MOTmethods [25, 26] are quite typical in this category, andthey generally solve the MOT problem using minimum-cost flow optimization. In [25], linking person hypothesesover time is formulated as a minimum cost lifted multicutproblem. In order to track interacting objects well, Wanget al. [26] propose novel intertwined flows to handle thisissue. Integer program is also often used for formulatingdata association in MOT [27, 28]. In [27], the quadratic in-teger program formulation is solved to local optimality bycustom heuristics based on recursive search. Mixed integerprogram is introduced to handle the interaction of multi-ple objects in [28]. In [29], a non-Markovian approach isproposed to impose global consistency by using behavioralpatterns to guide the association. These offline methodsgenerally yield better performance by incorporating futureframes into formulation and optimization, but this char-acteristic and the resulted high complexity also add greatconstraints to their application.

The online methods only use information up to the cur-rent frame and require no deferred processing, which are

more practical in real-world applications. In [14], the dataassociation between consecutive frames is formulated asbipartite matching and solved by structural support vec-tor machines. Bae et al. [11] perform online multi-objecttracking by combination of local and global associationbased on tracklet confidence. Recently, more sophisticatedlearning methods are introduced to handle this problem.In [30], the online association is modeled by Markov Deci-sion Process (MDP) with reinforcement learning. In [31],RNNs are employed to learn the data association from datafor online multi-object tracking. While the recent worksspend costly computation in online joint association, thispaper introduces an efficient solver for the online associa-tion based on the outputs of the MBN network.

3. Instance-Aware Representation Learning

Our INARLA framework incorporates instance-awarerepresentation learning into joint association for onlinemulti-person tracking and can combine with any humandetector. As shown in Fig. 2, we train a multi-branchneural network (MBN) for instance-aware representationlearning. In a new frame, our approach embeds the MBNnetwork’s outputs in an association matrix to jointly inferthe objects’ states, which will be fed back to the MBNnetwork.

3.1. Multi-branch neural network

The architecture of our MBN network is illustratedin Fig. 2, which consists of three main components: ashared backbone-subnet, a det-pruning-subnet and a vari-able number of instance-subnets. The backbone-subnetis fully convolutional and can take an image of arbitrarysize as input to extract convolutional features. Amongthe branch subnets, the det-pruning-subnet is designedto evaluate and reject the noisy person proposals froma public human detector and also to initialize instance-subnets, while each instance-subnet predicts the location

3

Page 4: Instance-Aware Representation Learning and Association for ... · Instance-Aware Representation Learning and Association for Online Multi-Person Tracking Hefeng Wua,b,f, Yafei Huc,

of its tracked person and also outputs the confidence scoreof a candidate being the target.

We build the MBN network from the Fast R-CNNmodel [32] using CaffeNet [33]. We borrow the lower fivelayers from Fast R-CNN architecture as our backbone-subnet, while the branch subnet structure is specially de-fined to accommodate our task. Different branch subnetshave the same structure definition. In order to handle theonline learning of tracked instances with few examples,we define a lightweight branch subnet architecture, whichcomprises a region-of-interest (ROI) layer, and three fullyconnected layers with size of 256, 256 and 2, respectively.

3.2. Network learning

For concise description, we use Fbb to denote the backbone-subnet and Fi to denote the ith branch subnet. The 0thbranch is the det-pruning-subnet and the ith branch (i ≥1) is the ith instance-subnet, which dynamically changesin conformance with the number of maintained persons.In addition, fi denotes the corresponding network that isformed by subnet Fbb and the ith branch subnet Fi (i.e.fi = Fbb + Fi).

The backbone-subnet Fbb is initialized from the Fast R-CNN model trained on the large-scale VOC datasets [32].We initialize the det-pruning-subnet F0 from zero-meanGaussian distributions with standard deviation 0.01.

We train the network f0 = Fbb +F0 offline, and employa multi-task loss L on each labeled RoI to jointly optimizefor classification and distance metric embedding:

L = Lcls(p, u) + µL(x⊥|x+,X−) (1)

where Lcls(p, u) = − log pu is defined as the log loss func-tion over two classes. p = (p0, p1) is computed by a soft-max over the 2 outputs in the final fully connected layer,and u=1 indicates the target and u=0 otherwise.

1triplet-like loss

RoI pooling

256 256 2

4096

softmax

FCs

backbone-subnet

256x6x6

det-pruning-subnet

Figure 3: Multi-task learning of the network f0 = Fbb + F0.

As illustrated in Fig. 3, we add an auxiliary subnet (inthe dashed-line box), consisting of two fully connected lay-ers with sizes 4096 and 1, respectively. A triplet-like lossis used: L(x⊥|x+,X−) =

∑x−∈X− φ(D(H(x⊥),H(x−))−

D(H(x⊥),H(x+))). Here x⊥ and x+ are positive exam-ples of the same human object (e.g., sampled nearby or atdifferent frames), while X− denotes a set of negative ex-amples. H(·) denotes the 4096-dimensional feature vector,and D(·, ·) is the L2 norm distance (i.e. D(H(x),H(y)) =‖H(x)−H(y)‖22). The function φ(x) is defined as φ(x) =log2(1 + 2−x) [34].

This triplet-like loss can drive similar (dissimilar) ex-amples close to (apart from) each other in the featurespace. Optimizing the multi-task loss Eq. (1) can makethe feature exacted by the backbone-subnet suitable fordiscriminating both human/non-human objects and dif-ferent humans, which is helpful for later instance-subnettraining and prediction. To maintain the balance of posi-tive and negative examples, we set the cardinality of X−as 2. Thus the batch size for optimization is a multiple of4. The hyperparameter µ in Eq. (1) is set as 0.7 in ourexperiments.

In optimization process, the gradients of the triplet-likeloss L(x⊥|x+,X−) with respective to the vector H(x) canbe calculated based on the chain rule:

∂L

∂H(x⊥)= 2

∑x−∈X− ψc(H(x−)−H(x+))

∂L∂H(x+) = 2

∑x−∈X− ψc(H(x⊥)−H(x+))

∂L∂H(x−) = 2

∑x−∈X− ψc(H(x−)−H(x⊥))

(2)

where ψc = (1 + 2dc)−1 and dc = D(H(x⊥),H(x−)) −D(H(x⊥),H(x+)).

We train the network f0 in a hard-example-mining scheme[35]. Specifically, we start with a dataset of positive exam-ples and a random set of negative examples. The networkf0 is trained to converge on this dataset and subsequentlyapplied to a larger dataset to harvest false positives. Thenthe network is trained again on the augmented trainingset with the false positives added. The auxiliary subnet isremoved when training is finished.

In the test stage, the instance network fi (i ≥ 1) iscreated dynamically by adding a new branch instance-subnet Fi and trained online when a person is newly de-tected. The new instance-subnet Fi is initialized from sub-net F0, and further trained using only the classification lossLcls(p, u) by setting µ = 0 in Eq. (1).

We collect N+ (=500) positive samples and N−(=256)negative samples. The intersection-over-union (IoU) over-lap ratios of positive and negative samples with this tar-get’s detection bounding box are greater than θ1 (=0.5)and less than θ0 (=0.3), respectively. Beyond that, wecollect N+ positive samples from every other object asnegative samples for this new target to make its specificsubnet more discriminative. In updating, we exploit hardnegative examples for online training in the hard-example-mining scheme. Given a sample x, the score fi(x) measuresthe similarity between the sample x and the person targeti.

3.3. Instance prediction

In frame t, we apply the proposed MBN network for in-stance prediction tasks. An instance-subnet independentlypredicts the corresponding target’s location xt, which con-sists of center coordinates (cx, cy), width lw and heightlh. We sample Q candidates {skt }k=1:Q varying in dis-placement and scale for each target from its previous lo-cation xt−1. Specifically, a candidate is denoted as skt =

4

Page 5: Instance-Aware Representation Learning and Association for ... · Instance-Aware Representation Learning and Association for Online Multi-Person Tracking Hefeng Wua,b,f, Yafei Huc,

(cx + δx, cy + δy, lw · δl, lh · δl), with (δx, δy, δl) drawn froma normal distribution whose mean is (0, 0, 1) and covari-ance is a diagonal matrix with diagonal vector σs. Thecandidates of the target i will pass the network fi and gettheir scores {fi(skt )}. Most previous works select the can-didate with the maximum score as the optimal location.However, this strategy renders unstable prediction. It isbecause our features are extracted from a downsamplinglayer, and candidates with similar locations may be pro-jected to the same region in the feature map and thus getthe same feature after RoI pooling. Such instability willbe more drastic for small-sized objects. We use a simpleand effective scheme to overcome this problem by averag-ing all the locations whose score over α ·maxk=1:Q fi(s

kt ).

So the predicted location of target i will be calculated as

xit = mean({skt | fi(skt ) > α · maxk=1:Q

fi(skt )}) (3)

4. Joint State Inference for Tracking

New Tracked Lost Discarded

Figure 4: State transition of an individual.

Different states are employed to describe a person tar-get in the video, and Fig. 4 shows the state transition. Aperson in the “New” state denotes being newly detected,and a new identity will be assigned to it (a new instance-subnet will be initialized as well) before it transits to the“Tracked” state. When the “Tracked” person is considerednot found in a frame, its state will be changed to the “Lost”state. The “Lost” person is still maintained and contin-ues to be looked for, and it will transit to the “Tracked”state again if it is found. However, if the “Lost” personstays in this state for a certain amount of frames, it will bechanged to the “Discarded” state, and all its information(identity and instance-subnet) will be removed. Based onthe outputs of MBN, we propose an efficient solver for thejoint state inference.

4.1. Joint association matrix construction

Assume that we maintain M tracked person targetsand there exist N new person observations in frame t afterapplying the proposed MBN network. Let {xit}Mi=1 be the

M targets’ predictions and {zjt }Nj=1 (f0(zjt ) ≥ 0.5) be theN person observations.

As shown in Fig. 5(a), a conventional association ma-trix can be constructed, with each element reflecting thepairwise relationship between prediction and observation.

The association matrix is equivalent to a bipartite graph,with the predictions and observations as nodes and thematrix elements as edge weights. The association problemis thus can be solved to obtain matching pairs with lowestcost via graph optimization methods such as max-flow orHungarian algorithms. In our context, the prediction withmatched observation is considered successfully tracked. Aprediction (observation) with no match is considered aslost (new target). However, the aforementioned associ-ation matrix may easily run into the risk of generatinguncorrect pairs of prediction and observation.

Therefore, we propose to construct a novel joint asso-ciation matrix C that can bridge the joint association opti-mization with a standard assignment problem. In our for-mulation, as illustrated in Fig. 5(b), the rows and columnsboth comprise predictions and observations, and thus pre-dictions (observations) can assign not only to their coun-terparts but also explicitly to themselves. In this way,the joint association matrix can be divided into 4 blocks,and each has meaningful representation when its elementis chosen (i.e., lost, tracked or new target).

To be specific, matrix C is defined below:

C =

(Λ Υ

ΥT Γ

)(4)

where C is a (M +N)× (M +N) square matrix, with rowand column indices representing M predictions and N newobservations. Matrix C is composed of 4 blocks, where anelement chosen in the submatrix Λ(M×M), Υ(M×N) andΓ(N ×N) implies that the corresponding target’s state isjudged as “Lost”, “Tracked” and “New”, respectively. ΥT

denotes the transpose of Υ.A type of function p∗(·, ·), ∗ ∈ {Λ,Υ,Γ} is introduced

to measure the pairwise relationship. A larger value ofp∗(·, ·) indicates stronger correlation.

In block Λ, we define its element as follows:

Λij =

{pΛ(xit, x

jt ), if i = j

−∞, otherwise(5)

Here, when a prediction is highly self-associated, weconsider it to be lost. For two predictions of different per-son targets, we do not assign any coupling evidence andset the value to be −∞.

In block Υ, we define its element as follows:

Υij = pΥ(xit, zjt ) (6)

where i ∈ {1, ...,M} and j ∈ {1, ..., N}. The elementdefinition indicates that a target is successfully trackedwhen it is highly coupled with a person observation.

In block Γ, we define its element as follows:

Γij =

{pΓ(zit, z

jt ), if i = j

−∞, otherwise(7)

Similar to the definition of the elements in Λ, a person

5

Page 6: Instance-Aware Representation Learning and Association for ... · Instance-Aware Representation Learning and Association for Online Multi-Person Tracking Hefeng Wua,b,f, Yafei Huc,

Υ11 Υ12 Υ13

Υ21 Υ22 Υ23 Υ11

Υ12

Υ13 Υ21

Υ22 Υ23

Υ11 Υ12 Υ13

Υ21 Υ22 Υ23

Γ11

Γ22

Γ33

Λ11

Λ22

Υ11 Υ21

Υ12 Υ22

Υ13 Υ23

Υ11 Υ12 Υ13 Υ21 Υ22 Υ23

Υ11 Υ12 Υ13 Υ22 Υ23 Υ21

Λ11 Λ22 Γ11 Γ22 Γ33

(a) (b)

1

tx

2

tx

1

tz 2

tz 3

tz 1

tx 2

tx

1

tz 2

tz 3

tz

1

tx 2

tx

1

tx 2

tx

3

tz2

tz1

tz

1

tz 2

tz 3

tz

Figure 5: Illustration of joint association matrix. (a) The conventional association matrix and its equivalent bipartite graph. (b) Our joint

association matrix and its equivalent graph. {xit}2i=1 are instance predictions, {zjt }3j=1 are person observations, they serve as nodes in the

equivalent graph and the matrix elements serve as edge weights. See text for explanations.

observation that highly associates itself is considered as anew target. We also do not assign any coupling evidencebetween any two person observations and set the corre-sponding value to be −∞.

The essential issue is how to define the functions p∗(·, ·)so that the aforementioned requirements can be satisfied.Many criteria based on multiple cues in the literature, suchas appearance and motion, can be exploited. In this paper,we propose to use measurements tightly associated withour MBN network. We define p∗(·, ·) as the sum of twoterms:

p∗(·, ·) = λ∗G∗(·, ·) + (1− λ∗)B∗(·, ·), ∗ ∈ {Λ,Υ,Γ} (8)

where G∗ and B∗ are related to the confidence and loca-tion outputs of the MBN network, respectively. The threeparameters λ∗, ∗ ∈ {Λ,Υ,Γ} are preset constants.

In particular, we define

GΥ(xit, zjt ) = fi(z

jt ), BΥ = IoU(xit, z

jt ) (9)

where fi(zjt ) denotes the output confidence by feeding ob-

servation zjt into the ith instance detector. IoU(xit, zjt )

is an intersection-over-union function which returns thearea ratios of intersection and union between the bound-ing boxes of xit and zjt .

Then the terms GΛ, BΛ, GΓ and BΓ are defined asfollows:

GΛ(xit, xit) = 1− fi(xit),

BΛ(xit, xit) = 1−maxN

k=1 IoU(xit, zkt )

(10)

GΓ(zjt , zjt ) = 1−maxM

k=1 fk(zjt ),

BΓ(zjt , zjt ) = 1−maxM

k=1 IoU(xkt , zjt )

(11)

Specifically, Eq. (10) indicates that a target is consideredself-associated (or lost) when its own instance-subnet out-puts low confidence and the predicted location is weaklycoupled with the observations. Likewise, a person obser-vation is considered self-associated (or new object), as im-plied by Eq. (11), when it retrieves low evidence from allavailable instance-subnets and their predicted locations.We note that the terms G∗ and B∗, ∗ ∈ {Λ,Υ,Γ} are allin the range [0, 1].

4.2. Joint state inference

By constructing the joint association matrix, the jointtracking inference problem of all targets can be convertedto an assignment problem by finding an optimal permuta-tion vector y consisting of {1, 2, ...,M + N}. The energyfunction is formulated as:

y∗ = arg maxy

M+N∑k=1

C(k, yk) (12)

where yk ∈ {1, 2, ...,M + N} is the kth element of y andC(k, yk) denotes the matrix element in row k and columnyk of C. Let cm to be the maximum element of C, andreplace each element C(i, j) with cm−C(i, j) to obtain thematrix C′. Then Eq. (12) is equivalent to

y∗ = arg miny

M+N∑k=1

C′(k, yk) (13)

We solve this energy function efficiently via the Hungarianalgorithm [6].

We will update the instance-subnet Fi when the targeti is in “Tracked” state but with fi(x

it) < γ. For a per-

son observation that is inferred as “New”, a correspond-ing branch subnet will be initialized for it. For a targeti judged in “Lost” state, if it has been in this state forτ consecutive frames, it will be transferred to the “Dis-carded” state. Otherwise it will continue to be predictedand participate in the joint inference in next frame.

Algorithm 1 depicts the procedure of the proposedINARLA framework.

4.3. Assumption validation

There exists a key assumption of selection in C. Thatis, we have to ensure that once the elements in Υ are cho-sen, the symmetric elements in ΥT must be chosen as well,because we incorporate both predictions and observationsin rows and columns and thus a matched pair should taketwo symmetric elements simultaneously. Fortunately, dueto the special structure of C, this assumption can be vali-dated.

Let us take the joint matrix in Fig. 5(b) for explana-tion. It can be observed that elements marked in red form

6

Page 7: Instance-Aware Representation Learning and Association for ... · Instance-Aware Representation Learning and Association for Online Multi-Person Tracking Hefeng Wua,b,f, Yafei Huc,

Algorithm 1 The overall procedure of our INARLAframeworkInput:

A video sequence VInitial MBN: Backbone-subnet Fbb and det-pruning-subnet F0

Output:Trajectories of targets T

1: Initialization: T ← ∅2: for each frame t in V do3: Take the person proposals {zj} for a public human

detector4: Use F0 to reject false detections from {zj}5: for each maintained person i in T do6: Fi produces the predicted score and location (refer

to Sec. 3.3)7: end for8: Construct association matrix and infer the state of

each target (refer to Sec. 4)9: Perform trajectory update of “Tracked” targets and

initialization of “New” targets10: Update the MBN according to the state of each tar-

get11: end for

a potential optimal solution, with each occupying distinctrow and column and the elements being symmetric. How-ever, the two elements marked in green in the left andthe three elements marked in red in the right also seemto form a plausible optimal solution. But we will showthat this is not true in our formulation context. Assumesuch asymmetric solution to be optimal. Let AΥ be thesum of elements chosen in Υ and A′Υ be the sum of ele-ments chosen in ΥT . If AΥ > A′Υ, it is obvious that wecan choose the elements in ΥT that are symmetric to thosechosen in Υ to get a better solution. It conflicts with theoptimum assumption. It is a similar case when AΥ < A′Υ.It is almost impossible that AΥ = A′Υ because we set ma-trix elements in floating numbers. In the extreme situationthat AΥ = A′Υ, the problem has multiple optimal solutionseven not expressed in our joint matrix. In practice, exten-sive experimental results show that the optimal solution issymmetric.

5. Experiments

5.1. Experimental settings

Dataset The proposed method is evaluated on the 2DMOT 2015 benchmark dataset [36], which contains 11 se-quences for training and 11 sequences for testing, consist-ing of sequences filmed by both static and moving camerasin unconstrained environments. The MOT benchmark re-leases ground truth for the training sequences. The hu-man detection results provided by the benchmark dataset,which were generated by the ACF detector [37], are used

in our evaluation so as to provide fair comparison withother MPT methods.Evaluation metrics Multiple metrics are used to eval-uate the tracking performance as suggested by the MOTresearch community [38, 39], including Multiple ObjectTracking Accuracy (MOTA, taking FN, FP and IDS intoaccount), ID F1 Score (IDF1, the ratio of correctly iden-tified detections over the average number of ground-truthand computed detections), Mostly Tracked targets (MT,the ratio of ground-truth trajectories that are covered bya track hypothesis for at least 80% of their respective lifespan), Mostly Lost targets (ML, The ratio of ground-truthtrajectories that are covered by a track hypothesis for atmost 20% of their respective life span), the total number ofFalse Positives (FP), the total number of False Negatives(FN), the total number of ID Switches (IDS), the totalnumber of times a trajectory is Fragmented (Frag), andprocessing speed (Hz, in frames per second excluding thedetector) on the benchmark.MBN architecture As mentioned in Sect. 3.1, thestructure of the backbone-subnet is the same as the lowerfive layers of CaffeNet used in Fast R-CNN [32]. Specifi-cally, the five convolutional layers have 96 kernels of size11× 11, 256 kernels of size 5× 5, 384 kernels of size 3× 3,384 kernels of size 3×3 and 256 kernels of size 3×3, respec-tively. The output feature maps of the first two convolu-tional layers are max-pooled (3×3 kernel) and normalizedbefore being fed into the next layer. Moreover, outputsof all the five layers are immediately filtered by a rectifiedlinear unit (ReLU) before any pooling or normalization op-eration. Branch subnets, including the det-pruning-subnetand instance-subnets, have the same structure, consistingof a ROI layer, and three fully connected layers with sizeof 256, 256 and 2, respectively.Implementation details Our algorithm is implementedin python using Caffe platform. The network f0 (backbone-subnet with pruning subnet) is trained on the training setfrom [36] for 40K SGD iterations and the learning rateis lowered by 0.1× in the last 10k iterations. We dou-ble the learning rate for training instance network for fastadaption and run for 50 iterations. The images on bothtraining and testing phases are rescaled so that the shorterside of them is 600 pixels. We set λΛ = 0.2, λΥ = 0.85,λΥ = 0.4, σs = (25, 25, 0.01), α = 0.75, γ = 0.5 and τ = 10in the experiments by empirical study. We will further dis-cuss important parameter settings in ablation study (Sect.5.3). Our algorithm runs on a PC with 8 cores of 3.70 GHZCPU, and a Tesla K40 GPU.

5.2. Benchmark evaluation

We compare our INARLA tracker with nine recent on-line MPT methods that published their results on the 2DMOT 2015 benchmark, including TSDA OAL [43], RNN LSTM[31], OMT DFH [44], EAMTTpub [45], oICF [46], SCEA[24], MDP [30], DCCRF [47] and AM [48]. Among them,RNN LSTM, DCCRF and AM are deep learning-basedmethods. We also include three recent deep learning-based

7

Page 8: Instance-Aware Representation Learning and Association for ... · Instance-Aware Representation Learning and Association for Online Multi-Person Tracking Hefeng Wua,b,f, Yafei Huc,

Table 1: Quantitative evaluation results on the 2D MOT 2015 benchmark.Algorithm MOTA(%) ↑ IDF1(%) ↑ MT(%) ↑ ML(%) ↓ FP ↓ FN ↓ IDS ↓ Frag ↓ Hz ↑SiameseCNN (2016)[40] † 29.0 34.3 8.5 48.4 5160 37798 639 1316 52.8CNNTCM (2016)[41] † 29.6 36.8 11.2 44.0 7786 34733 712 943 1.7QuadMOT (2017)[42] † 33.8 40.4 12.9 36.9 7898 32061 703 1430 3.7

TSDA OAL (2017)[43] 18.6 36.1 9.4 42.3 16350 32853 806 1544 19.7RNN LSTM (2016)[31] 19.0 17.1 5.5 45.6 11578 36706 1490 2081 165.2OMT DFH (2017)[44] 21.2 37.3 7.1 46.5 13218 34657 563 1255 28.6EAMTTpub (2016)[45] 22.3 32.8 5.4 52.7 7924 38982 833 1485 12.2oICF (2016)[46] 27.1 40.5 6.4 48.7 7594 36757 454 1660 1.4SCEA (2016)[24] 29.1 37.2 8.9 47.3 6060 36912 604 1182 6.8MDP (2015)[30] 30.3 44.7 13.0 38.4 9717 32422 680 1500 1.1DCCRF (2018)[47] 33.6 39.1 10.4 37.6 5917 34002 866 1566 0.1AM (2017)[48] 34.3 48.3 11.4 43.4 5154 34848 348 1463 0.5INARLA (Ours) 34.7 42.1 12.5 30.0 9855 29158 1112 2848 2.6

† denotes offline methods.

Table 2: Object density (OPF) and tracking efficiency (FPS) of each sequence on test set.

Sequence Density Speed Sequence Density SpeedETH-Crossing 4.6 6.1 ETH-Jelmoli 5.8 4.6ETH-Linthescher 7.5 5.4 KITTI-19 5 4.2TUD-Crossing 5.5 4.8 KITTI-16 8.1 3.2ADL-Rundle-3 16.3 2.0 Venice-1 10.1 3.3ADL-Rundle-1 18.6 1.3 PETS09-S2L2 22.1 1.0AVG-TownCentre 15.9 1.0

offline MPT methods (i.e., SiameseCNN [40], CNNTCM[41] and QuadMOT [42]) for comparison. Table 1 sum-marizes the quantitative comparison results, and the bestresult in each metric is marked in bold font. The up-arrownext to a metric indicates higher values are better, whilethe down-arrow indicates lower values are better.

Among these metrics, MOTA is an integrated met-ric that summarizes multiple aspects of tracking perfor-mance and is used by the MOT benchmark for rankingthe trackers. Our method achieves the highest MOTAagainst these recent methods including the deep learning-based methods. Moreover, our method also achieves thebest performance in terms of ML and FN since our networkachieves robust performance in the presence of missing de-tections. The outstanding performance demonstrates theadvantages of our MBN network and joint state inferencesolver. However, working in a frame-by-frame way, ourmethod will recover targets judged as “Lost” for manytimes, resulting in a high Frag value. This can be furtheraddressed by introducing a proper post-processing strat-egy. Fig. 1 and Fig. 6 illustrate our tracking results onthe test set of the MOT benchmark in static and dynamicscenes, respectively.

Our algorithm runs at around 2.6 frames per secondwithout code optimization. Note that the number of trackedobjects actually affects the running speed. Therefore, weshow in Table 2 the relationship between the density (ob-jects per frame, OPF) and the processing speed (framesper second, FPS) on each sequence of the test set. It canbe inferred from Table 2 that the speed of a single instancetracker roughly ranges from 20 to 30 fps. Due to the prop-erties of our MBN, we are confident that improved process-

ing efficiency can be achieved by parallel implementationin branch subnets.

5.3. Ablation study

The contributions of different components in our methodare assessed on the 2D MOT 2015 benchmark. The ab-lation study is conducted on the training set because theannotations of the test set are not released and the bench-mark webpage limits evaluation submissions (a user canonly post a submission every three days and submit nomore than 3 times in total). The 11 training sequencesare partitioned into training and validation subsets to an-alyze the proposed algorithm, with 5 sequences (TUD-Stadtmitte, ETH-Bahnhof, ADL-Rundle-8, PETS09-S2L1,KITTI-13) for training and the rest for validation.

Table 3 reports the quantitative evaluation results ofdifferent versions of our MPT method in ablation study.The results of the full version of our method, which con-tains all the proposed components, are shown in the lastrow of the table. Below we evaluate and analyze eachcomponent of the proposed MPT method in detail.

1) MBN network: The offline training of our MBN net-work is augmented with an auxiliary subnet in a multi-task optimization scheme, as described in Sect. 3.2. Andit aims to make the MBN network more discriminativefor our MPT task. To evaluate its effectiveness, we re-move the auxiliary subnet and set µ = 0 in Eq. (1) foroffline model training, and this version of our method istermed “no aux loss”. From Table 3, we can observe thatits MOTA performance drops by about 1% with most ofthe other metrics also degraded. The increase in FP re-

8

Page 9: Instance-Aware Representation Learning and Association for ... · Instance-Aware Representation Learning and Association for Online Multi-Person Tracking Hefeng Wua,b,f, Yafei Huc,

Figure 6: Our tracking results on representative MOTChallenge dynamic scenes including ETH-Crossing, ETH-Jelmoli, ETH-Linthescher,KITTI-19 and ADL-Rundle-1, from top to bottom.

Table 3: Quantitative comparison of different versions of our method in ablation study.Version MOTA(%) ↑ IDF1(%) ↑ MT(%) ↑ ML(%) ↓ FP ↓ FN ↓ IDS ↓ Frag ↓no aux loss 40.2 48.3 20.9 33.9 3286 9233 223 510no pruning 32.2 29.9 17.0 35.2 3549 10286 605 779no update 38.7 45.5 18.3 34.8 3451 9404 216 488

only IoU 38.5 44.1 19.6 36.5 3164 9711 237 458only confidence 36.4 43.1 16.5 39.1 3220 10066 267 464balance learned 39.7 47.6 20.4 35.7 3370 9258 219 467greedy 25.4 31.8 17.4 37.8 5579 9726 597 753

with vgg16 39.0 46.7 19.1 36.1 3058 9708 242 489with vgg m 40.6 47.3 19.1 36.5 3031 9397 237 487

full 41.1 48.7 21.7 35.7 3097 9248 201 461

veals it includes more false human detections. These re-sults demonstrate the positive role of the auxiliary subnet.

The “no pruning” version of our method denotes ourframework does not include the process of the det-pruning-subnet that aims to filter out false human detections. Ascan be observed, its MOTA drops dramatically to 32.2%,with a decrease of 8.9%. The FP metric increases from3097 to 3451. A sharp performance degradation can beviewed in most of the metrics, which demonstrates thesignificant effectiveness of the det-pruning-subnet.

The instance-subnets of our MBN network are dynam-ically added and trained online. They are also updatedduring tracking so as to adapt to appearance changes of

corresponding human instances. The “no update” versiondenotes an instance-subnet will not be updated after itis trained. As shown in Table 3, the deterioration in allthe metrics except ML reveals the importance of onlineupdate.

2) Association matrix: The second group of rows inTable 3 evaluates the effectiveness of our data associationcomponent that builds upon the constructed associationmatrix. As depicted by Eq. (8), elements of the associ-ation matrix involves two terms (i.e., output confidenceand IoU) and three parameters (i.e., λΛ, λΥ and λΓ). Wecarry out experiments to evaluate their influence on ourmethod’s performance.

9

Page 10: Instance-Aware Representation Learning and Association for ... · Instance-Aware Representation Learning and Association for Online Multi-Person Tracking Hefeng Wua,b,f, Yafei Huc,

The “only confidence” and “only IoU” versions of ourmethod denote Eq. (8) only contains the confidence- orIoU-related term, corresponding to setting λ∗ = 1 andλ∗ = 0 (∗ ∈ {Λ,Υ,Γ}), respectively. Performance degra-dation in all the metrics are witnessed from Table 3 forboth these two versions. We can also infer that the IoU-related term has a larger impact on our method’s perfor-mance because “only IoU” performs better than “only confidence”in the evaluation metrics.

We further discuss the problem of balancing the twoterms in Eq. (8), i.e., choosing the best values for param-eters λ∗ (∗ ∈ {Λ,Υ,Γ}). A balance-learning scheme wastried to find the optimal parameter setting. The scheme isdesigned as follows. Given initial parameter setting of λ∗,the proposed algorithm is run on the training set. Thenwe check the ground-truth for a pair in function p∗(·, ·) ev-ery frame, and the expected output of p∗(·, ·) is set as 1 ifthe pair is matched and 0 otherwise. We learn λ∗ by min-imizing the sum of squared errors of actual and expectedoutputs. The process is executed for several iterations withthe learned value of λ∗ as new initial setting. The best re-sults of this balance-learning scheme are shown in Table3 as “balance learned”. As can be seen, this scheme doesnot work quite well. It performs worse than the “full” ver-sion in which λ∗ are manually set by empirical study. Infuture, we will try new schemes to handle this problem.

To further analyze the contribution of our associationcomponent, we replace it with a simple greedy associationalgorithm. That is, in the association stage, a new per-son observation will be assigned to a tracked target whohas the largest IoU ratio of bounding boxes with it. Thisversion of our method is termed “greedy”. As exhibited inTable 3, its performance worsens sharply in all the metrics,which instead reveals the significant role of the proposedassociation component.

3) Choices of backbone-subnet: As described in Sect.5.1, the backbone-subnet of our MBN network is CaffeNet,a small-scale neural network. Here we make other choicesfor the backbone-subnet to evaluate their impact on theperformance. Specifically, we use vgg cnn m 1024 [49] andvgg16 [50] network models as the backbone-subnet. Thevgg cnn m 1024 model is the same deep as CaffeNet butis wider, and the vgg16 model is very deep with 16 layers.With these two models, the corresponding versions of ourmethod are termed “with vgg m” and “with vgg16” in Ta-ble 3. It can be observed that “with vgg m” has almostthe same performance with “full”, with 0.5% decrease inMOTA. However, “with vgg16” shows larger degradationin performance with MOTA decreased by 2.1%. We visu-alized the tracking results and took in-depth analysis, andfound that the “with vgg16” version did not work well onsmall-sized persons. It may be attributed to that a smallimage region contains less appearance details that are im-portant for discriminating instances of the same category(e.g. human) and the feature extracted by the deep vgg16model is less reflective of those details since the vgg16architecture induces stronger reduction of subtle features

(e.g., with more max-pooling layers than CaffeNet), as alsoreported in previous work [18, 19]. It is also worth notingthat the “full”, “with vgg m” and “with vgg16” versionsrun at about 2.7, 2.2, 1.6 FPS averagely on the valida-tion set, respectively. The foregoing comparison revealsthe “full” version performs the best in both accuracy andefficiency among the three versions.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

Update threshold

0

5

10

15

20

25

30

35

40

45

50

Pe

rce

nta

ge

(%

)

MOTA

IDF1

MT

ML

Figure 7: Analysis of update threshold on the validation set.

4) Update threshold: To evaluate the influence of theupdate threshold γ on our method’s performance, we changeits value while fixing the values of the other parameters.The results are plotted in Fig. 7 with four metrics (MOTA,IDF1, MT and ML) that are expressed in percentage. Ascan be observed, the performance is the best when the up-date threshold is at around 0.5, but the performance doesnot exhibit a sharp change as the threshold changes.

6. Conclusion

In this paper, we have introduced a novel deep learn-ing based online multi-person tracking approach that em-phasizes instance-aware representation learning with theMBN network. While the backbone-subnet provides ro-bust deeply-learned image feature, the instance-subnetscast instance-level appearance discrimination to reduce am-biguities between different targets and release the burdenof later data association. We construct an association ma-trix based on the outputs of the MBN network for jointstate inference of the targets, where a simple yet effectivesolver is developed thanks to the powerful support fromMBN. The effectiveness of our approach is verified throughextensive experimental evaluation with recent MPT meth-ods.

There are several directions that we can improve theproposed INARLA framework in future. First, the backbone-subnet of our MBN network will be enhanced to empowerits extracted feature more robustness and discrimination.Our approach can handle small-sized objects better by

10

Page 11: Instance-Aware Representation Learning and Association for ... · Instance-Aware Representation Learning and Association for Online Multi-Person Tracking Hefeng Wua,b,f, Yafei Huc,

making the feature extraction process adapt to differentsizes of objects. Second, a more efficient model shouldbe designed for the instance-subnet. This is because wefound in experiments that online training and updating ofinstance-subnets often occupy more than half of the to-tal processing time although the instance-subnet in ourMBN network has a light-weight structure. Recent worksshow that correlation filter models can achieve good accu-racy at high running speed in single object tracking. Wewill make in-depth attempts to incorporate such modelsinto our MBN network since they also involve convolution.Third, more effort will be devoted to the state inferenceprocedure. We will investigate more effective terms forcomposing elements of the association matrix and exploitnew data association algorithms for the online MPT task.Moreover, we intend to extend our work to incorporate fullcategory detection and form a unified framework.

Acknowledgement

This work was supported by the National Natural Science

Foundation of China (61876045, U1811463), Zhujiang Science

and Technology New Star Project of Guangzhou (201906010057),

the Major Program of Science and Technology Planning Project

of Guangdong Province (2017B010116003), and Guangdong

Natural Science Foundation (2016A030313285). The authors

would like to thank Shiyi Hu and Xu Cai who partly joined

this work when they were graduate students at Sun Yat-sen

University.

References

[1] L. Lin, G. Wang, W. Zuo, X. Feng, L. Zhang, Cross-domainvisual matching via generalized similarity measure and featurelearning, TPAMI 39 (6) (2016) 1089–1102.

[2] J. Xie, G. Dai, F. Zhu, E. K. Wong, Y. Fang, Deepshape: Deep-learned shape descriptor for 3d shape retrieval, TPAMI 39 (7)(2017) 1335–1345.

[3] Y. Wu, F. Yin, C. Liu, Improving handwritten chinese textrecognition using neural network language models and convo-lutional neural network shape models, Pattern Recognition 65(2017) 251–264.

[4] A. Milan, L. Leal-Taixe, K. Schindler, I. Reid, Joint trackingand segmentation of multiple targets, in: CVPR, 2015, pp.5397–5406.

[5] S. Wang, C. Fowlkes, Learning optimal parameters for multi-target tracking, in: BMVC, 2015.

[6] J. Munkres, Algorithms for the assignment and transportationproblems, Journal of the Society for Industrial and AppliedMathematics 5 (1) (1957) 32–38.

[7] W. Choi, S. Savarese, Multiple target tracking in world coordi-nate with single, minimally calibrated camera, in: ECCV, 2010,pp. 553–567.

[8] N. Dalal, B. Triggs, Histograms of oriented gradients for humandetection, in: CVPR, 2005, pp. 886–893.

[9] A. Andriyenko, K. Schindler, Multi-target tracking by continu-ous energy minimization, in: CVPR, 2011, pp. 1265–1272.

[10] J. Qian, J. Yang, G. Gao, Discriminative histograms of localdominant orientation (D-HLDO) for biometric image featureextraction, Pattern Recognition 46 (10) (2013) 2724–2739.

[11] S. H. Bae, K. J. Yoon, Robust online multi-object trackingbased on tracklet confidence and online discriminative appear-ance learning, in: CVPR, 2014, pp. 1218–1225.

[12] B. Benfold, I. D. Reid, Stable multi-target tracking in real-timesurveillance video, in: CVPR, 2011, pp. 3457–3464.

[13] H. Li, H. Wu, H. Zhang, S. Lin, X. Luo, R. Wang, Distortion-aware correlation tracking, IEEE TIP 26 (11) (2017) 5421–5434.

[14] S. Kim, S. Kwak, J. Feyereisl, B. Han, Online multi-target track-ing by large margin structured learning, in: ACCV, 2012, pp.98–111.

[15] P. Lenz, A. Geiger, R. Urtasun, et al., Followme: Efficient onlinemin-cost flow tracking with bounded memory and computation,in: ICCV, 2015, pp. 4364–4372.

[16] H. Wu, C. Gao, Y. Cui, R. Wang, Multipoint infrared laser-based detection and tracking for people counting, Neural Com-puting and Applications 29 (5) (2018) 1405–1416.

[17] P. F. Felzenszwalb, R. B. Girshick, D. A. McAllester, D. Ra-manan, Object detection with discriminatively trained part-based models, TPAMI 32 (9) (2010) 1627–1645.

[18] L. Wang, W. Ouyang, X. Wang, H. Lu, Visual tracking withfully convolutional networks, in: ICCV, 2015, pp. 3119–3127.

[19] H. Li, H. Wu, S. Lin, X. Luo, Coupling deep correlation filterand online discriminative learning for visual object tracking,Journal of Computational and Applied Mathematics 329 (2018)191–201.

[20] Z. Cui, S. Xiao, J. Feng, S. Yan, Recurrently target-attendingtracking, in: CVPR, 2016.

[21] P. Ondruska, I. Posner, Deep tracking: Seeing beyond seeingusing recurrent neural networks, in: AAAI, 2016, pp. 3361–3368.

[22] L. Leal-Taixe, C. Canton-Ferrer, K. Schindler, Learning bytracking: Siamese cnn for robust target association, in: CVPRWorkshops, 2016.

[23] A. Gaidon, E. Vig, Online domain adaptation for multi-objecttracking, in: BMVC, 2015, pp. 1–13.

[24] J. H. Yoon, C.-R. Lee, M.-H. Yang, K.-J. Yoon, Online multi-object tracking via structural constraint event aggregation, in:CVPR, 2016.

[25] S. Tang, M. Andriluka, B. Andres, B. Schiele, Multiple peo-ple tracking by lifted multicut and person re-identification, in:CVPR, 2017.

[26] X. Wang, E. Turetken, F. Fleuret, P. Fua, Tracking interactingobjects using intertwined flows, TPAMI 38 (11) (2016) 2312–2326.

[27] B. Leibe, K. Schindler, L. J. V. Gool, Coupled detection andtrajectory estimation for multi-object tracking, in: ICCV, 2007,pp. 1–8.

[28] X. Wang, E. Turetken, F. Fleuret, P. Fua, Tracking interactingobjects optimally using integer programming, in: ECCV, 2014,pp. 17–32.

[29] A. Maksai, X. Wang, F. Fleuret, P. Fua, Non-markovian glob-ally consistent multi-object tracking, in: ICCV, 2017, pp. 2563–2573.

[30] Y. Xiang, A. Alahi, S. Savarese, et al., Learning to track: Onlinemulti-object tracking by decision making, in: ICCV, 2015, pp.4705–4713.

[31] A. Milan, S. H. Rezatofighi, A. R. Dick, K. Schindler, I. D. Reid,Online multi-target tracking using recurrent neural networks,ArXiv (2016) abs/1604.03635.

[32] R. Girshick, Fast R-CNN, in: ICCV, 2015, pp. 1440–1448.[33] A. Krizhevsky, I. Sutskever, G. E. Hinton, Imagenet classifica-

tion with deep convolutional neural networks, in: NIPS, 2012,pp. 1097–1105.

[34] H. Yun, P. Raman, S. V. N. Vishwanathan, Ranking via robustbinary classification, in: NIPS, 2014, pp. 2582–2590.

[35] A. Shrivastava, A. Gupta, R. Girshick, Training region-basedobject detectors with online hard example mining, in: CVPR,2016.

[36] L. Leal-Taixe, A. Milan, I. Reid, S. Roth, K. Schindler, Motchal-lenge 2015: Towards a benchmark for multi-target tracking,ArXiv (2015) abs/1504.01942.

[37] P. Dollar, R. Appel, S. Belongie, P. Perona, Fast feature pyra-mids for object detection, TPAMI 36 (8) (2014) 1532–1545.

[38] K. Bernardin, R. Stiefelhagen, Evaluating multiple object track-

11

Page 12: Instance-Aware Representation Learning and Association for ... · Instance-Aware Representation Learning and Association for Online Multi-Person Tracking Hefeng Wua,b,f, Yafei Huc,

ing performance: the CLEAR MOT metrics, EURASIP Journalon Image and Video Processing (2008) 1–10.

[39] E. Ristani, F. Solera, R. S. Zou, R. Cucchiara, C. Tomasi, Per-formance measures and a data set for multi-target, multi-cameratracking, in: ECCV Workshops, 2016, pp. 17–35.

[40] L. Leal-Taixe, C. Canton-Ferrer, K. Schindler, Learning bytracking: Siamese CNN for robust target association, in: CVPRWorkshops, 2016.

[41] B. Wang, L. Wang, B. Shuai, Z. Zuo, T. Liu, K. L. Chan,G. Wang, Joint learning of convolutional neural networks andtemporally constrained metrics for tracklet association, in:CVPR Workshops, 2016.

[42] J. Son, M. Baek, M. Cho, B. Han, Multi-object tracking withquadruplet convolutional neural networks, in: CVPR, 2017, pp.3786–3795.

[43] J. Ju, D. Kim, B. Ku, D. K. Han, H. Ko, Online multi-persontracking with two-stage data association and online appearancemodel learning, IET Computer Vision 11 (2017) 87–95.

[44] J. Ju, D. Kim, B. Ku, D. K. Han, H. Ko, Online multi-objecttracking with efficient track drift and fragmentation handling,J. Opt. Soc. Am. A Opt. Image Sci. Vis. 34 (2) (2017) 280–293.

[45] R. Sanchez-Matilla, F. Poiesi, A. Cavallaro, Online multi-targettracking with strong and weak detections, in: ECCV Work-shops, 2016, pp. 84–99.

[46] H. Kieritz, S. Becker, W. Hubner, M. Arens, Online multi-person tracking using integral channel features, in: AVSS, 2016,pp. 122–130.

[47] H. Zhou, W. Ouyang, J. Cheng, X. Wang, H. Li, Deep continu-ous conditional random fields with asymmetric inter-object con-straints for online multi-object tracking, IEEE TCSVT (2018)1–12, online available.

[48] Q. Chu, W. Ouyang, H. Li, X. Wang, B. Liu, N. Yu, Onlinemulti-object tracking using CNN-based single object trackerwith spatial-temporal attention mechanism, in: ICCV, 2017,pp. 4846–4855.

[49] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Returnof the devil in the details: Delving deep into convolutional nets,in: BMVC, 2014.

[50] K. Simonyan, A. Zisserman, Very deep convolutional networksfor large-scale image recognition, ArXiv (2014) abs/1409.1556.

12


Recommended