Identifying People in Camera Networks using Wearable … · 2019. 3. 19. · Thiago Teixeira,...

Identifying People in Camera Networksusing Wearable Accelerometers

Thiago Teixeira, Deokwoo Jung, Gershon Dublon and Andreas SavvidesYale University

10 Hillhouse AveNew Haven, CT

[email protected]

ABSTRACTWe propose a system to identify people in a sensor net-work. The system fuses motion information measured fromwearable accelerometer nodes with motion traces of eachperson detected by a camera node. This allows people tobe uniquely identified with the IDs the accelerometer-nodethat they wear, while their positions are measured using thecameras. The system can run in real time, with high pre-cision and recall results. A prototype implementation usingiMote2s with camera boards and wearable TI EZ430 nodeswith accelerometer sensorboards is also described.

Categories and Subject DescriptorsC.3 [Special-Purpose and Application-Based Systems]:Real-time and embedded systems; I.4 [Image Processingand Computer Vision]: Scene Analysis—Sensor fusion,Tracking

General TermsMeasurement, Design, Experimentation

KeywordsUnique identification, Consistent labelling, Association prob-lem

1. INTRODUCTIONA large obstacle to the deployment of assisted-living sys-

tems in multiple-person or family homes is the problemof differentiating between people and uniquely identifyingthem in order to properly attend their individual needs. Forthis reason, much of the current assisted-living technologyfocuses on single-person scenarios — and often break as soonas visitors are invited into the home. Additionally, multiple-person homes present complex privacy requirements for as-sistive technologies, in the sense that only those who volun-tarily choose to utilize the system people should have any

Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.PETRA’09 June 9-13, 2009, Corfu, GreeceCopyright 2009 ACM ISBN 978-1-60558-409-6 ...$5.00.

Figure 1: Sensor nodes used in the prototype sys-tem. Left: TI EZ430-RF2480 ZigBee node with ac-celerometer sensorboard. Right: Intel Mote2 withcustom camera board.

private information captured and stored. Meanwhile, cam-eras are becoming increasingly popular sensors in assistiveenvironments, given their long sensing range and ability tomeasure distinct information modalities (such as location,pose, motion path and ambient lighting). However, theproblem of associating detected people across multiple im-age frames as well as robustly identifying them solely withvisual features is still a topic of much research in computervision.

In this paper we present a real time system that identi-fies people in camera networks with high accuracy throughthe use of wearable accelerometer nodes with known IDs(Figure 1). We bypass the computer vision correspondenceproblem by matching local motion signatures from wearableaccelerometers with those observed from infrastructure cam-eras. This way, we are able to obtain the location of eachperson using the camera detections and to estimate theiridentities by matching their motion characteristics. The ad-vantage of this approach is that it provides reliable oper-ation at reduced cost for assistive applications. Instead ofrelying on expensive video analytics to identify people, wemake use of very limited information from the cameras andconstruct a unique modality pair by coupling cameras andaccelerometers through wireless links. An overview of thisprocess is shown in Figure 2, and more detail is providedin Sections 3 and 4. As described in the evaluation section(Section 5), the system runs in real time with high precisionand recall metrics. We start the paper with a discussionregarding related work (Section 2) following by descriptionof the problem in Section 3.

2. RELATED WORKAccelerometers and cameras are often combined to track

the camera motion, generally for use in robot navigation [1],

...

ID=1

ID=2

accelerometermeasurements

body acceleration

identification:matching accelerometers to tracks

association: matching current detections with past detections

tracks detected people in current frame

accelerometer

camera base

wireless

Figure 2: To associate IDs with each detected people, we find the best match between measured bodyacceleration and each person’s motion in the image plane. To obtain this motion information, tracks mustbe formed from combinations of detected people in the image sequence. Since the number of tracks growsexponentially, the main issue that must be solved is how to keep the track count down.

and virtual or augmented reality [2]. In such cases, the ac-celerometer is placed on the same rigid object as the camera,which moves in relation to its environment. This contrastswith the setup described in this paper, where the camera isstationary and accelerometers are placed on the moving peo-ple in the camera’s field-of-view (FOV) in order to identifythem.

In the literature, people in sensor networks have beenlocalized and identified using wearable sensors such as IDbadges [3][4], the combination of ultrasound with radios[5], radios signal properties [6], and inertial sensing units[7]. Ultrasound-based approaches require bulky nodes thatconsume relatively large quantities of energy. Although IDbadges present low spatial resolution when used by them-selves, motion models [3] or additional sensing modalitiescan be used to improve spatial accuracy [4]. Radio signalstrength localization is subject to many random factors inuncontrolled environments, such as antenna orientation [8].Radio Doppler-shift has been used to localize targets [6], butrequire a large number of infrastructure nodes. In [7], ac-celerometers and magnetometers are used along with ID sen-sors. This approach requires the knowledge of the buildingmap to constrain the location of the particles in their parti-cle filters. One similarity among many of the multi-personsolutions is that they require an exact association betweendetected people from one frame to people detected in thenext. This problem is known to be NP-hard [9]. One of theseminal works in this area is the multiple-hypothesis track-ing algorithm [10]. When only the position of each detectedperson is used to perform this association, this is called themotion correspondence problem and is the subject of muchresearch [11]. Other times, additional image features (suchas size, color, shape or motion gradient) [12][13] or motionmodels are used to offer additional clues regarding frame-to-frame associations, but usually with mixed results in uncon-trolled environments. In contrast, the algorithm presentedhere is lightweight, does not make assumptions about mo-tion models, and does not require an exact solution of theassociation problem as input.

3. PROBLEM DESCRIPTIONThe problem we solve in this paper is the matching of the

locations of people detected with a camera network to theiraccelerometer signals, in order to obtain location-ID pairs.The core of the identification problem can be described asfinding the matching between accelerometers and detectedlocations that maximizes a similarity measure. Therefore,

if Zk is the set of all accelerometer measurements at timek, and Xk is the set of all detected locations, then at eachtime k we must find the match matrix Mk according to theexpression below:

arg maxMk

|Zk|Xi=1

|Xk|Xj=1

f(zik, xjk)M ij

k (1)

where zik is the ith accelerometer measurement at time k,and xjk is the jth detected position at that time. Note thatthe index i of the accelerometer measurements is the ID ofthe nodes that transmitted them, while the j’s are randominternal IDs of each detected person without actual phys-ical relevance. The match matrix Mk is a matrix of size|Zk| × |Xk| describing the associations between accelerome-ters and detected people in the image frame. Since the sameperson cannot be wearing two accelerometers, and the sameaccelerometer cannot be in more than one place at a time,Mk must follow a few constraints:

M ijk =

1 ⇒ zik, x

jk are associated

0 ⇒ no association(2)

M ijk = 1⇒

M i`k = 0 ∀ ` ∈ [1, |Zk|] , ` 6= j

M `jk = 0 ∀ ` ∈ [1, |Xk|] , ` 6= i

(3)

Despite the brief definition, the problem that is targetedin this paper cannot be solved by directly associating ac-celerometers to detected people as described in Equation 1.Instead, the two types of measurements (accelerations andpositions) must be brought to a common representation inorder to be compared, which, as will be described later, leadsto an exponentially complex problem. As shown in Figure 2,our solution is divided into two parts:

Identification — In order to match the motion data fromthe wearable accelerometers with detections from the cam-era nodes, we transform each into a signal that is propor-tional to the person’s floor-plane acceleration. We, then,measure their similarity by computing their correlation co-efficient. This is described in Section 3.1. To obtain ob-tain these acceleration measurements from the detected lo-cations, however, we must first obtain a time-series of loca-tions for each person in the scene (tracks). This is calledthe multi-dimensional association problem, and it is knownto be NP-hard [9].

Association — Rather than solving the association prob-lem, we make use of the fact that correct tracks must belongto real people in the scene, and therefore must correlate withsome accelerometer (exactly one, in fact). We use this to

time instantk−7 k−6

k−5

k−4

k−3

k

detected person's location at instant k

k−8 k−2

k−1

C × person's speed (from camera)1

C × person's speed (from accelerometer)

ID=2

ID=1

detections

2

Figure 3: Base case described in Section 3.1, wherea single person is in the camera’s field of view,while two accelerometers are within communicationrange. Signals proportional to the person’s speedare compared, and the best matching accelerometeris found to have ID = 2.

Figure 4: Superimposition of aligned signals fromaccelerometer and camera, showing the approximateproportionality between them.

approximate the multi-dimensional association problem asa one-dimensional association with polynomial complexity.This is described in Section 3.2.

In the following subsections we describe two base-case sce-narios from which our solutions to the identification and as-sociation stages are derived.

3.1 Base-Case for Identification Problem:1 Person in FOV, 2 Accelerometers

Consider the scenario where there is a single person in thecamera FOV while two accelerometer nodes can be heardthrough the wireless channel (Figure 3). If it is known apriori that there is only one person in the FOV, then itis simple to create a time-history of the person’s locations,as there are no frame-to-frame association ambiguities: theperson detected in image frame Ik at time k is always asso-ciated to the person detected in frame Ik+1. This reducesthe association problem to a trivial step and allows us tofocus solely on the identification. This section describes theprocess by which we compare and match signals in order toidentify people in the FOV.

Position measurements from the camera contain instan-taneous information with an absolute frame of reference inspace, and with no association among previous measure-ments (no frame of reference in time). Meanwhile, accel-eration measurements obtained from the body-mounted ac-celerometer nodes have no spatial frame of reference, buthave a clearly defined temporal frame of reference. To findthe similarity between these two signals, we must first con-

accelerometer ID

1 2 3 4 5−1

0

1track1

1 2 3 4 5−1

0

1track2

correlation

count

thre

shold

(a) (b)

Figure 5: (a) Correlation between 5 tracks and 2 ac-celerometer signals, showing the large difference be-tween correct and incorrect matches. (b) Histogramwith sampling distribution of correlation coefficient.A clear threshold separates correct matches fromincorrect ones.

vert them into a common representation, or intermediaryformat. This process is known as data alignment [14]. Wealign the two signals to the same temporal frame of refer-ence by associating all position measurements that belongto the same person into a time series. From this time se-ries, the person’s acceleration in the image plane can beeasily extracted by double differentiation, as shown in Fig-ure 3. We also align the accelerometer measurements into asignal proportional to the overall body acceleration by cal-culating the magnitude of the 3D acceleration vector andfinding the envelope of the signal to remove noise causedby the stepping motion and by accelerometer-bouncing ar-tifacts [15][16]. Figure 4 shows the similarity between twomatching signals that were processed in this manner. Thesetwo signals are proportional to the person’s floor-plane ac-celeration, and, therefore, also proportional to one another.

If α and β are the functions which align accelerometer andcamera signals into the same common representation, thenthe similarity g(·, ·) between the two signals can be calcu-lated by detecting whether the two signals are proportionalusing Pearson’s correlation coefficient r:

g(zik, θ`k) = r(α(zik), β(θ`k)) (4)

where θ`k is a track containing a time series of consecutiveperson detections θ`k = (x`0k−n, ..., x

`nk ) with n ∈ N and 0 <

n < k. Figure 5(a) shows the experimental value of thecorrelation coefficient between 5 tracks and 2 accelerometersignals. The correct matches can be easily seen by theirstrong correlations.

Note, however, that g : Zk × Θk 7→ R from Equation 4has a different domain than the similarity function f fromEquation 1. To assign IDs to detected people using g, themaximization problem in Equation 1 must be modified touse tracks rather than person detections:

arg maxΩk

|Zk|Xi=1

|Θk|X`=1

g(zik, θ`k)Ωi`k (5)

where Θk = θ`k is the set of all tracks at instant k, andΩk is a match matrix associating accelerometer signals totracks. The matrix Ωk follows similar rules as Mk (Equa-tion 3) but it additionally does not allow the same detectedperson to be assigned to multiple tracks at any time instant.So Ωk must follow the additional rule that for any two el-ements equal to 1, the corresponding tracks must have anempty intersection:

Ωi`1k = Ωi`2k = 1, `1 6= `2 =⇒ θ`1k ∩ θ`2k = ∅ (6)

Figure 6: Limiting number of track hypothesesthrough gating. When a single detection is withinthe gate, identification can be done directly (Sec-tion 3.1). Otherwise, association must take place(Section 3.2).

We call this the “strong no-intersection” property, which willbe relaxed in Section 4 in order to approximate the solutionfor real time operation.

3.2 Base-Case for Association Problem:2 People in FOV with Accelerometers

In the previous section we outlined a signal-comparisonmethod to identify people given a trivial base scenario. Inthe same vein, in this section we will employ a base-casescenario to describe a method by which accelerometer mea-surements can be used to influence and simplify associationdecisions. To understand the underlying problem, considerthe situation where it is known that exactly two people arein the FOV, and they are within a large distance of oneanother. Then it is easy to infer their position histories(association) by creating tracks connecting each detectionat frame k to the only detection at k + 1 that is within aphysically plausible speed threshold — a process known asgating.

However, if the two people approach one another (Fig-ure 6) tracking ambiguities arise, giving rise to multiplecompeting track hypotheses. If all possible track hypothesesare considered by the tracking algorithm, then due to com-binatorial explosion the complexity of the problem quicklybecomes unmanageable. This is shown in Figure 7, wheretwo people meet for 6 time instants (at 15 frames per secondthis corresponds to 0.4s) generating more than 64 hypothe-ses. If it is known that there are exactly N people in theFOV, then the number of hypotheses after K ambiguousframes is N !K . If the people are allowed to enter/leave, anda realistic detector is assumed (with the possibility of falsepositive detections), then the number is even larger.

This association problem can be described as selecting theset of tracks that, at each time instant k, globally minimizessome distance metric h:

arg minΦk

|Θk−1|X`=1

|Xk|Xj=1

h(θ`k−1, xjk)Φ`jk (7)

where Φk is a match matrix that follows the same rules as Min Equation 3 (which causes the constructed tracks to natu-rally follow the strong no-intersection rule from Equation 6).From Φk, the set of tracks Θ?

k which solves Equation 7 fortime instant k is directly obtained. The simplest similar-ity metric for track-to-location association is the Euclideandistance between the track’s latest location x`nk−1 and the

Figure 7: Base-case for Section 3.2. After this6-sample-long ambiguity, the number of track hy-potheses grows as an exponential of a factorial.

detection xjk:

h(θ`k−1, xjk) = dist(x`nk−1, x

jk) (8)

where n is the track length. This is often called nearest-neighbor association.

As described before, it is not tractable to exactly solveEquation 7 due to the expansive number of tracks. Luckily,to identify the people in the scene it is not necessary to solvethis complex association problem, since the no-intersectionproperty is handled later in the process by the maximiza-tion in Equation 5. So we bypass this problem by generatingseveral conflicting track hypotheses (Θk), rather than find-ing the best non-conflicting solution (Θ?

k). The set Θk oftrack hypotheses is defined to contain all tracks that pass agoodness criterion:

Θk =

θ`k ∈ Θk−1 ×Xk :

h(θ`k−1, xjk) < τθ,

∃ zik ∈ Zk | g(zik, θ`k) > τr

ff(9)

where τθ and τr are thresholds that filter out bad hypothe-ses. Thus, only tracks within the gate are considered (h(·, ·) <τθ). Here, the similarity measure g is used in a manneranalogous to the use of additional image attributes (size,color, shape) and motion models that are usually employedin multiple-target trackers. In this case, we keep only thetracks that can be explained to some degree by at least one ofthe accelerometer signals. This is described in greater detailin Section 4.2. Of course, image and motion attributes fromthe literature can be used in addition to the accelerometersignal, for increased robustness if necessary.

4. PERSON-IDENTIFICATIONALGORITHM

As our person-identification formulation is composed oftwo interconnected parts (association and identification), wedesign our algorithm as a cycle consisting of two blocks: atracker and a comparator.

The tracker generates a set Θk of tracks from sequencesof person detections, filtering them according to the param-eters that rely solely on track properties (i.e. the τθ filterfrom Equation 9). The comparator is in charge of perform-ing the maximization in Equation 5 taking Θk as input, andpruning tracks that do not pass the τr filter.

The comparator then passes the set Θ′k of filtered tracks

track hypothesesΘ =θ kk

ℓ

tracks well-correlated with accelerometers

Θ′ =θ′ kkℓ

track-to-accelerometer assignments that maximize

global similarity

argmax ∑ d(θ , z )ki

kℓ

Ω (i,j)∈Ω

trackerextends θ′ with

person detections x kjk–1

ℓ

x kj

comparatorcompares θ with

accelerometer signals z kik

ℓ

detected people

Figure 8: Overview of the entire algorithm show-ing the interactions of its two main logical blocks:a tracker generates a small set of track hypothesesfrom the large pool of possible associations; and acomparator solves the association problem describedin Equation 5, assigning IDs to each detected people.

back into the tracker as input. The output of the algorithmis found by using the Hungarian method [17] to solve theone-dimensional association problem from Equation 5 withcomplexity O(max(|Xk|, |Θk|)3).

This cycle is the core of our method, and is summarizedin Figure 8.

Although in the problem description we discussed the sim-ilarity between each track and each accelerometer, g(zik, θ

`k),

this is not exactly how things take place in the identifica-tion algorithm. Instead, each track is marked as belongingto a single accelerometer, which is the only one it will everbe compared to. The reason for this is that the correla-tion coefficient requires the two input signals to be of thesame length. When each track is created, we must boot-strap the sufficient statistics to compute its correlation witheach specific accelerometer. This also allows us to keep thecomplexity low, since tracks that have historically not cor-related well with a given accelerometer can be pruned andnever compared to that accelerometer again.

Other than this, the algorithm takes two main approachesto allow real time operation: (1) it simplifies the accelerom-eter-to-track assignment problem in Equation 5 by weaken-ing the track no-intersection property; (2) it restrains thenumber of track hypotheses to a minimum through severalmeans.

4.1 Relaxing the No-Intersection PropertySince our algorithm aims to provide the best immediate

results without the intention to reconstruct past traces, werelax the strong no-intersection constraint of Equation 6 torequire only that the newest position measurements in eachmatched track do not intersect. That is, the following weakno-intersection constraint is used instead:

Ωi`1k = Ωi`2k = 1, `1 6= `2 =⇒ xj1k ∈ θ`1k 6= xj2k ∈ θ

`2k (10)

Although this relaxes the strong no-intersection property ofEquation 6, the similarity measure g used in the identifica-tion (Equation 5) guarantees that tracks correlate well withtheir matched accelerometer. So, as long as the motion ofthe people in the scene is not too similar and synchronizedwith one another, most tracks selected by Equation 10 willstill be strongly non-intersecting. In the case that their mo-tion is correlated, then it is not possible to identify thembased on motion characteristics alone, whether strong no-

intersection is enforced or not. Hence, this simplificationhas little negative effect on the quality of the tracks, whilegreatly limiting the problem’s complexity.

4.2 Adjustments to Control the Number of Hy-potheses

Combinatorial Contention — When there are ambigu-ous situations, such as in Figure 7, the number of tracksgrows exponentially. In order to contain this growth, weonly resolve ambiguities after the people move apart. Forthis, the algorithm keeps track of the number of people in-side each track’s gate (a circle of radius R). If the numberis greater than one, then the track is marked as being am-biguous. Otherwise, it is marked as unambiguous. Eachambiguous track θ`k−1 gets extended into time k as θ`k byassigning it the closest detection xk, rather than forkinginto one track for each within-gate detection. When a tracktransitions from ambiguous to non-ambiguous, however, itis forked for each detection inside a gate with radius 2R. IfN2R is the number of people in the 2R gate, then, insteadof ending up with N !K tracks as before, each track splitsinto just N2R alternatives, most of which are pruned withina few seconds by a track-pruning process.

Pruning Tracks and Allowing “Leaving” — If atrack correlates badly with all accelerometer signals, thenit cannot belong to an accelerometer-wearing person, andshould be pruned. Figure 5(b) shows a histogram of thecorrelation values of correct and incorrect accelerometer-to-track assignments. It is clear from the plot that the two canbe easily distinguished, and that a threshold value τr ≈ 0.55can be used for this purpose. However, as shown in Fig-ure 9(a), the correlation r between an accelerometer and atrack takes a few seconds to converge. Oftentimes the cor-rect accelerometer-track association has a poor correlation(< τr) for the first few seconds, which can cause correcttracks to be prematurely pruned.

For this reason, we compute the estimated correlation er-ror as a function of track age by using confidence intervals.But since Pearson’s correlation coefficient does not have aGaussian sampling distribution, we must first convert it withFisher’s z ′ transformation, for which confidence intervals canbe calculated:

z ′(r) = 12

ln[(1 + r)/(1− r)] (11)

The standard error of z ′ is known to be SE = 1/√n− 3,

where n is the number of samples used in the computation ofthe correlation. With this, we compute the 90% confidenceinterval of f as ranging from z ′low to z ′high:

z ′low(r) = r − 1.645√n−3

z ′high(r) = r + 1.645√n−3 (12)

where the number 1.645 comes from the 90% confidence in-terval of a Normally distributed random variable (i.e. 90%of the density is within 1.645 standard deviations from themean). Equation 9 is, then, modified to apply the τr thresh-old on z ′high instead. That is, α(·) > τr becomes z ′high(·) >z ′(τr). This way, the only tracks that get pruned are thosewhere there is a 95% confidence that the track does not cor-relate above τr (95% because the threshold acts on a single-sided confidence interval). For comparison, Figure 9(b) showsthe z ′ and confidence intervals for the signals from Fig-ure 9(a).

Since correlations of longer signals have a smaller standarderror, they are inherently more trustworthy. We, therefore,

premature pruning

(a)

no pruning:top-marginabove threshold

(b)

Figure 9: Top: Since the correlation takes timeto converge, the use of a fixed threshold for track-pruning can result in the correct track being prema-turely deleted. Bottom: We guard against prema-ture pruning this by applying the threshold on thetop margin of the 90% confidence interval, resultingin 95% confidence pruning.

prioritize longer tracks by using z ′low instead of r in Equa-tion 4. So if two tracks have the same r (and, hence, thesame z ′) the older track will be given a higher weight in gsince the z ′low will be higher for the older track. With thischange, Equation 4 becomes:

g(zik, θ`k) = z ′[ r(α(zik), β(θ`k)) ]− 1.645/

√n− 3 (13)

Note that, as the standard error cannot be computed fortracks smaller than 4 samples, we only allow a track to bepruned if its size is greater than 4. Given that new tracks arecreated at the end of each ambiguous period, this causes thenumber of tracks to depend on the number of ambiguities.

Faster Error Recovery and Allowing “Entering”— When a new person enters the camera FOV, a new trackmust be created for comparison with each accelerometer.Similarly, when a new accelerometer is detected, it mustbe included for comparison with each existing track. Forthis reason, the algorithm always keeps at least one trackfor each accelerometer-location combination. If one doesnot exist, it is created. This can happen either because anew person or accelerometer has been detected (“entering”)or because an existing track has been pruned. The endresult is that tracks that may or may not represent a correctground-truth trace are constantly created (and constantlypruned, if they do not pass the τr threshold). This ensuresthat there is always one alternative for each accelerometer-location assignment, which allows for quick recovery in casea correct track becomes associated with the wrong detectiondue to tracking errors. This puts a lower bound of |Zk| ×|Xk| on the number |Θk| of track hypotheses at any time k.When there are no ambiguities, the track-pruning processensures that the lower-bound is reached. Hence, for mostreal-world cases, it is expected for the average number oftrack hypotheses to be close to |Zk| × |Xk|.

count 2 3 4 5precision 0.951 0.875 0.790 0.627recall 0.956 0.931 0.887 0.821proc. time (s) 2.57 6.29 11.02 16.74ambig./pers. 38.25 68.57 93.90 116.6avg. tracks 3.92 8.81 15.63 24.39max. tracks 8 15 24 30

Table 1: Experimental results for algorithm when 2,3, 4 or 5 people are in the FOV at the same time.

5. EVALUATIONWe first performed a set of experiments where data was

gathered with a wide-angle USB camera and an off-the-shelfinertial measurement unit. These were used to verify thecorrectness of the algorithm independently from implemen-tation-dependent effects, such as the performance of the per-son detector or of the network layer. A second set of exper-iments were performed using the iMote2 sensor node withour custom camera board [18], as well as TI EZ430-RF2480nodes equipped with a SparkFun IMU 5DOF board, contain-ing an Analog Devices ADXL330 accelerometer (Figure 1).The purpose of these is to demonstrate the viability of thesystem in actual multiple-person deployments. For all ofthese experiments, the cameras were mounted on a 3m highceiling, facing down. This gives a total area of 3m × 2mwhere people are entirely contained in the FOV. This is thearea within which the people were asked to stay. The ac-celerometer nodes were placed on the front of each person’sbelt. The orientation of the accelerometer is unimportant,given that it is the magnitude of the 3D acceleration vectorthat is used in the similarity metric.

We captured five separate videos and the correspondingaccelerometer traces of a single person walking in a room forapproximately 1 minute. The person detector used in thisexperiment computed the person in the scene by compar-ing each frame to an image of the empty room (backgroundsubtraction). Since the traces were captured separately in astatic, controlled environment, we were able to obtain highprecision image-plane coordinates for each person by calcu-lating the center of mass (centroid) of the foreground pixels.The accelerometers were sampled at 100Hz, and the cameraat 15Hz. Time was roughly synchronized by hand, by visu-ally matching the features from an acceleration magnitudeplot for each accelerometer to a plot of the correspondingcentroid’s speed.

We ran the algorithm for all different 2-person, 3-person,4-person and 5-person combinations of the five traces. Thecentroid traces were overlaid onto the same image plane andthe centroids’ internal index were randomly shuffled for eachframe. We additionally simulated people entering and leav-ing the field of view at random times while still being inrange for the accelerometer sensing. This was done by ran-domly cropping the beginning and end of the centroid traces,and leaving the accelerometer traces intact. For all of these,the ground truth frame-by-frame associations and absoluteperson IDs were known, given that the traces were acquiredseparately. Using the ground truth data, we calculated thefollowing metrics: Precision answers the question: whenthe system identifies a person, how often is the ID assign-ment correct? The precision is calculated as TP/(TP+FP ),where TP is the number of true positives (correct assign-

0 10 20 30 40 50 60

0

2

4

Track Assignments for Accelerometer 1

0 10 20 30 40 50 60incorrect

unambiguous

ambiguous

correctAmbiguity resolution

0 10 20 30 40 50 600

1

2

3Hypothesis correlation (z)

ground-truth

centroid ID

0 10 20 30 40 50 60

0

2

4


0 10 20 30 40 50 60incorrect

unambiguous

ambiguous


0 10 20 30 40 50 600

1

2


ground-truth

centroid ID

ground-truth

centroid ID

0 10 20 30 40 50 60

0

2

4


0 10 20 30 40 50 60incorrect

unambiguous

ambiguous


0 10 20 30 40 50 600

1

2


0 10 20 30 40 50 60

0

2

4


0 10 20 30 40 50 60incorrect

unambiguous

ambiguous

correct

ground-truth

centroid ID

Ambiguity resolution

0 10 20 30 40 50 600

1

2


Figure 10: Output of the algorithm for a 4-persontrace. The x-axis is time in seconds. The ambigu-ity resolution and hypothesis correlation plots foraccelerometers 2, 3 and 4 are omitted to conservespace.

ments) and FP is the number of false positives (incorrect as-signments). Recall answers the question: when a given per-son is in the scene, how often does the system correctly iden-tify him/her? The recall is calculated as TP/(TP + FN),where FN is the number of false negatives, that is, the num-ber of times the person was deemed absent when they wereactually present.

The averaged experimental values of these two metrics areshown in Table 1. The algorithm shows strong performancefor 2, 3 and 4 people. For 5 people, the precision falls un-der 0.7, but the recall stays high throughout. The measuredprocessing time is, as expected, proportional to the averagenumber of tracks. Since the number of tracks stays withinthe predicted value of |Zk| × |Xk|, the processing time re-mains nearly constant. Also note that the system alwaysexecuted many times faster than real-time.

The number of ambiguous frames per person (ambigs/per)is also reported. Were the tree-pruning process not present,the expected number of tracks would be in the order of|Xk|ambigs/per. This is at least 238.25 = 3.26 × 1011 forthe 2-person case, and much more for the others. The cor-rectness of the algorithm has a stronger dependence on theambig/person rate than on the number of people in the FOVper se.

Figure 10 shows the output of the algorithm for an exam-ple trace where data for 4 people were overlaid. The trackassignment plots show the ground-truth ID of the centroid

that was associated by the algorithm to each accelerometer.For an accelerometer with ID = A (where A is some inte-ger), it is desirable for the plot to be a constant line at y = A.This is often the case after enough time has passed for thecorrelations to converge, as seen in the plots. Meanwhile, theambiguity resolution plot (shown only for person 1, due tospace restrictions in this paper) shows how often ambiguitiesoccur (usually consisting of multiple frames at a time), andwhether the algorithm is able to resolve them correctly orincorrectly. On average, ambiguities were correctly resolved80.72% of the times. For the remaining 19.28% when theambiguity resolution failed, the algorithm eventually foundthe correct assignment through the correlation metric. Thatis, the algorithm is able to automatically recover from incor-rect hypotheses. Finally, the third type of plot in the figure,the hypotheses correlation plot, shows the z ′ metric of theselected hypothesis (thick blue line) compared to that of thelosing hypotheses for the same accelerometer (light blue).The hypotheses for other accelerometers are shown in lightgray. Note how after ambiguous periods small tracks forkfrom the correct one. They are quickly pruned by the com-binatorial contention process described in Section 4.2. Youcan find videos of these experiments at http://enaweb.eng.yale.edu/drupal/InertialIdentification.

To assess the viability of the system as an online sensornetwork service, we also tested a prototype implementationconsisting of an iMote2 camera node mounted on the ceil-ing, and two people carrying wearable EZ430 sensor nodeswith accelerometers. The centroid of each foreground blobwas extracted by segmenting them through 8-neighbor con-nected component analysis. As expected, this often resultedin the typical blob-merging and splitting artifacts that area product of small occlusions and visual similarity with thebackground scene. Detections were collected into packetscontaining pairs of centroids and timestamps, and transmit-ted wirelessly to a base node. The whole process took placeat a rate of around 15Hz in the sensor node, fluctuatingbased on the number of people in the FOV. The wearablenodes used in the experiment were programmed to samplethe accelerometer at a rate of 50Hz, calculating the signalenvelope locally, and transmitting it to the base throughits ZigBee radio. The collected data was then parsed ina nearby computer, resulting in precision and recall mea-surements comparable to those in Table 1. This prototypesystem demonstrates that it is possible to identify people us-ing acceleration and camera measurements under non-idealreal-world sensing conditions (including false positives, falsenegatives and other types of misdetections) as well as underthe constraints of limited local processing and networkingbandwidth.

6. CONCLUSIONWe have presented a system that uses infrastructure cam-

era nodes and wearable accelerometers to identify people ina sensor network, achieving good precision and recall. Otherthan memory and processing requirements, there is no limiton the number of tag-wearing people or the number of peoplein the FOV. We have also described a set of approximationsthat allow for real-time execution. Although these approxi-mations increase the number incorrect matches immediatelyfollowing ambiguous periods, experimental results show thealgorithm is able to quickly recover.

Possible improvements include utilizing additional image

features for increased robustness against ambiguities. Bycoupling this system with color histograms, for example,better detection rates should be easily achieved. Futurework includes expanding the algorithm to make use of mul-tiple cameras as a single seamless sensor, as well as con-sidering deployments where there are large gaps in cameracoverage. Before the system can be used in a long-term de-ployment, power consumption and network utilization mustbe properly analyzed. To this end, it is possible that anadaptive sampling and transmission scheme can be devised,which preprocesses accelerometer samples and only trans-mits them if it is deemed that they can significantly impactthe correlation metric.

AcknowledgmentsThis work was partially funded by the National ScienceFoundation under projects CNS 0448082 and CNS 0725706.Any opinions, findings and conclusions or recommendationexpressed in this material are those of the author(s) anddo not necessarily reflect the views of the National ScienceFoundation (NSF).

7. REFERENCES[1] KyuCheol Park, Dohyoung Chung, Hakyoung Chung,

and Jang Gyu Lee, “Dead reckoning navigation of amobile robot using an indirect kalman filter,”Multisensor Fusion and Integration for IntelligentSystems, 1996. IEEE/SICE/RSJ InternationalConference on, pp. 132–138, Dec 1996.

[2] M.S. Keir, C.E. Hann, J.G. Chase, and X.Q. Chen, “Anew approach to accelerometer-based head trackingfor augmented reality & other applications,” Sept.2007, pp. 603–608.

[3] N. Shrivastava, R. Mudumbai U. Madhow, andS. Suri, “Target tracking with binary proximitysensors: fundamental limits, minimal descriptions, andalgorithms,” in SenSys ’06: Proceedings of the 4thinternational conference on Embedded networkedsensor systems, New York, NY, USA, 2006, pp.251–264, ACM Press.

[4] Dirk Schulz, Dieter Fox, and Jeffrey Hightower,“People tracking with anonymous and id-sensors usingrao-blackwellised particle filters,” in Proceedings of theInternational Joint Conference on ArtificialIntelligence, 2003.

[5] A. Savvides, C.C. Han, and M. B. Srivastava,“Dynamic fine grained localization in ad-hoc sensornetworks,” in Proceedings of the Fifth InternationalConference on Mobile Computing and Networking,Mobicom 2001, Rome, Italy, July 2001, pp. pp.166–179.

[6] Branislav Kusy, Akos Ledeczi, and XenofonKoutsoukos, “Tracking mobile nodes using rf dopplershifts,” in SenSys ’07: Proceedings of the 5thinternational conference on Embedded networkedsensor systems, New York, NY, USA, 2007, pp. 29–42,ACM.

[7] Lasse Klingbeil and Tim Wark, “A wireless sensornetwork for real-time indoor localisation and motionmonitoring,” in IPSN ’08: Proceedings of the 2008International Conference on Information Processing in

Sensor Networks (ipsn 2008), Washington, DC, USA,2008, pp. 39–50, IEEE Computer Society.

[8] D. Lymberopoulos, Q. Lindsey, and A. Savvides, “Anempirical analysis of radio signal strength variabilityin ieee 802.15.4 networks using monopole antennas,”in under submission, April 2005.

[9] Lawrence D. Stone, Carl A. Barlow, and Thomas L.Corwin, Bayesian Multiple Target Tracking, ArtechHouse Publishers, 1999.

[10] Donald B. Reid, “An algorithm for tracking multipletargets,” in IEEE Transactions on Automatic Control,December 1979, vol. 24, pp. 843–854.

[11] C. J. Veenman, M. Reinders, and E. Backer,“Resolving motion correspondence for densely movingpoints,” in IEEE Transactions on Pattern Analysisand Machine Intelligence, March.

[12] Omar Javed, Zeeshan Rasheed, Khurram Shafique,and Mubarak Shah, “Tracking across multiplecameras with disjoint views,” in ICCV ’03:Proceedings of the Ninth IEEE InternationalConference on Computer Vision, Washington, DC,USA, 2003, p. 952, IEEE Computer Society.

[13] M. Taj, E. Maggio, and A. Cavallaro, “Multi-featuregraph-based object tracking,” in CLEAR, 2006, pp.190–199.

[14] L. Wald, “Definitions and terms of reference in datafusion,” in International Archives of Photogrammetryand Remote Sensing, 1999.

[15] A. Forner-Cordero, M. Mateu-Arce, I. Forner-Cordero,E. Alcantara, J. C. Moreno, and J. L. Pons, “Study ofthe motion artefacts of skin-mounted inertial sensorsunder different attachment conditions,” inPhysiological Measurement, April 2008, vol. 29, pp.21–31.

[16] Thong Y.K., Woolfson M.S., Crowe J.A., Hayes-GillB.R., and Challis R.E., “Dependence of inertialmeasurements of distance on accelerometer noise,”Measurement Science and Technology, vol. 13, pp.1163–1172(10), 2002.

[17] Harold W. Kuhn, “The hungarian method for theassignment problem,” in Naval Research LogisticQuarterly, 1955, vol. 52.

[18] T. Teixeira and A. Savvides, “Lightweight peoplecounting and localizing in indoor spaces using camerasensor nodes,” in ACM/IEEE InternationalConference on Distributed Smart Cameras, September2007.

Date post:	28-Apr-2021
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

Identifying People in Camera Networks using Wearable … · 2019. 3. 19. · Thiago Teixeira,...

Documents