+ All Categories
Home > Documents > Audio Visual Fusion and Tracking With Multilevel Iterative...

Audio Visual Fusion and Tracking With Multilevel Iterative...

Date post: 17-Sep-2020
Category:
Upload: others
View: 2 times
Download: 0 times
Share this document with a friend
14
Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected]. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. 1 Audio Visual Fusion and Tracking With Multilevel Iterative Decoding: Framework and Experimental Evaluation Shankar T. Shivappa Bhaskar D. Rao , and Mohan M. Trivedi Abstract—Speech is a natural interface for human com- munication. However building human computer interfaces in unconstrained intelligent spaces still remains a challenging task. Incorporating video information is shown to improve the perfor- mance of many audio applications. Similarly, information from the microphones is useful in computer vision tasks. One of the first steps in enabling natural human computer interaction is person tracking. In this paper we present a new approach to person tracking using both audio and visual information. We develop a multilevel framework to combine the audio and visual cues to track multiple persons in a meeting room equipped with cameras and microphone arrays. We discuss in detail the multilevel iterative decoding based audio-visual person tracker (MID-AVT). Extensive experimental evaluation of the MID-AVT and comparison to other audio-visual tracking techniques is also presented. The dataset consists of real meeting recordings with sensor configurations similar to those used in the CLEAR 2006 and CLEAR 2007 evaluation workshops. The overall accuracy of the tracker was 75%. The MID-AVT framework performed slightly better than particle filter based tracker when accurate camera and microphone calibration was available. However, the MID-AVT is also shown to be robust to sensor calibration errors while the particle filtering framework fails. In addition to the audio-visual person tracking results, we also track the active speaker at every instance of time and the results are presented. Index Terms—human computer interaction, hierarchical frameworks, human activity analysis, person tracking, iterative decoding I. I NTRODUCTION Speech is an important modality in human-human and human-computer interactions. Speech signals provide valuable information required to understand human activities and in- teractions. They are also a natural mode of communication for humans. Thus speech based interfaces are of primary importance in intelligent spaces. Similarly visual cues such as gestures, eye gaze and affect are also natural modes of expression for humans. However, building natural human- computer interfaces in an unconstrained intelligent space is a very challenging task [1][2]. The challenges includes problem such as clean speech acquisition from far-field microphones where robustness to environmental noise is critical. Audio- visual systems are gaining popularity to facilitate such robust natural interfaces, especially with far-field microphones and Copyright (c) 2008 IEEE. Personal use of this material is permitted. However, permission to use this material for any other purposes must be obtained from the IEEE by sending a request to [email protected]. The authors are with the Department of Electrical and Com- puter Engineering, University of California, San Diego, USA (e- mail:[email protected],[email protected],[email protected]. cameras. In general, in the context of an intelligent space, this involves the extraction of various kinds of audio and visual cues at different levels of semantic abstraction. Human activity in a scene is usually monitored using arrays of audio and visual sensors. Tasks such as person localization and tracking, speaker ID, focus of attention detection, speech recognition and affective state recognition are performed. The first step in this sequence is that of person localization and tracking. In this paper we present a iterative decoding based hierarchical framework for tracking multiple persons in a meeting room equipped with cameras and microphone arrays. The framework fuses audio and visual cues to locate and track persons. The rest of the paper is organized as follows. In section II, we present a brief outline of the overall goal of audio-visual fusion schemes in the context of intelligent spaces. We observe that person tracking is a fundamental step in designing intelligent spaces. In section IV, we present a detailed description of the multilevel iterative decoding based audio-visual person tracking (MID-AVT) framework, first introduced in [3]. This is followed by an extensive experimental evaluation of the person tracking scheme in our testbed. Quantitative comparison of the MID-AVT framework with existing audio-visual person tracking schemes in presented along with an analysis of the relative merits and de-merits of the various algorithms. II. OVERALL GOAL OF AUDIO- VISUAL FUSION IN INTELLIGENT SPACES An intelligent space ultimately aims to mimic the abilities of a human being - to interpret the real world situations including interaction with other humans. While little is known on how humans understand and interpret the complex world, the consensus is that an integration of information at different levels of the semantic hierarchy has to come together for this task. Early work in equipping intelligent spaces with active sensors was started at the CVRR lab at UCSD [2][1]. In [4], the authors propose a hierarchial HMM framework for modeling human activity. More recent hierarchical fusion strategies include [5][6][7][8]. In [8], the authors develop a probabilistic integration framework for fusion of audio visual cues at the track and identity levels. This is an example of fusion at multiple levels of abstraction. Similarly, in [9], the utility of head pose estimation and tracking for speech recognition from distant microphones is explored. In [10], the authors use video localization to enhance the performance of the beamformer for better speech reconstruction from far Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.
Transcript
Page 1: Audio Visual Fusion and Tracking With Multilevel Iterative ...cvrr.ucsd.edu/publications/2010/ShivappaRaoTrivedi... · frameworks, human activity analysis, person tracking, iterative

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

1

Audio Visual Fusion and Tracking With MultilevelIterative Decoding:

Framework and Experimental EvaluationShankar T. Shivappa Bhaskar D. Rao , and Mohan M. Trivedi

Abstract—Speech is a natural interface for human com-munication. However building human computer interfaces inunconstrained intelligent spaces still remains a challenging task.Incorporating video information is shown to improve the perfor-mance of many audio applications. Similarly, information fromthe microphones is useful in computer vision tasks. One of thefirst steps in enabling natural human computer interaction isperson tracking. In this paper we present a new approach toperson tracking using both audio and visual information. Wedevelop a multilevel framework to combine the audio and visualcues to track multiple persons in a meeting room equippedwith cameras and microphone arrays. We discuss in detail themultilevel iterative decoding based audio-visual person tracker(MID-AVT). Extensive experimental evaluation of the MID-AVTand comparison to other audio-visual tracking techniques is alsopresented. The dataset consists of real meeting recordings withsensor configurations similar to those used in the CLEAR 2006and CLEAR 2007 evaluation workshops. The overall accuracyof the tracker was 75%. The MID-AVT framework performedslightly better than particle filter based tracker when accuratecamera and microphone calibration was available. However, theMID-AVT is also shown to be robust to sensor calibration errorswhile the particle filtering framework fails. In addition to theaudio-visual person tracking results, we also track the activespeaker at every instance of time and the results are presented.

Index Terms—human computer interaction, hierarchicalframeworks, human activity analysis, person tracking, iterativedecoding

I. INTRODUCTION

Speech is an important modality in human-human andhuman-computer interactions. Speech signals provide valuableinformation required to understand human activities and in-teractions. They are also a natural mode of communicationfor humans. Thus speech based interfaces are of primaryimportance in intelligent spaces. Similarly visual cues suchas gestures, eye gaze and affect are also natural modes ofexpression for humans. However, building natural human-computer interfaces in an unconstrained intelligent space is avery challenging task [1][2]. The challenges includes problemsuch as clean speech acquisition from far-field microphoneswhere robustness to environmental noise is critical. Audio-visual systems are gaining popularity to facilitate such robustnatural interfaces, especially with far-field microphones and

Copyright (c) 2008 IEEE. Personal use of this material is permitted.However, permission to use this material for any other purposes must beobtained from the IEEE by sending a request to [email protected].

The authors are with the Department of Electrical and Com-puter Engineering, University of California, San Diego, USA (e-mail:[email protected],[email protected],[email protected].

cameras. In general, in the context of an intelligent space, thisinvolves the extraction of various kinds of audio and visualcues at different levels of semantic abstraction. Human activityin a scene is usually monitored using arrays of audio andvisual sensors. Tasks such as person localization and tracking,speaker ID, focus of attention detection, speech recognitionand affective state recognition are performed. The first step inthis sequence is that of person localization and tracking. Inthis paper we present a iterative decoding based hierarchicalframework for tracking multiple persons in a meeting roomequipped with cameras and microphone arrays. The frameworkfuses audio and visual cues to locate and track persons. Therest of the paper is organized as follows. In section II, wepresent a brief outline of the overall goal of audio-visual fusionschemes in the context of intelligent spaces. We observe thatperson tracking is a fundamental step in designing intelligentspaces. In section IV, we present a detailed description ofthe multilevel iterative decoding based audio-visual persontracking (MID-AVT) framework, first introduced in [3]. This isfollowed by an extensive experimental evaluation of the persontracking scheme in our testbed. Quantitative comparison ofthe MID-AVT framework with existing audio-visual persontracking schemes in presented along with an analysis of therelative merits and de-merits of the various algorithms.

II. OVERALL GOAL OF AUDIO-VISUAL FUSION ININTELLIGENT SPACES

An intelligent space ultimately aims to mimic the abilitiesof a human being - to interpret the real world situationsincluding interaction with other humans. While little is knownon how humans understand and interpret the complex world,the consensus is that an integration of information at differentlevels of the semantic hierarchy has to come together forthis task. Early work in equipping intelligent spaces withactive sensors was started at the CVRR lab at UCSD [2][1].In [4], the authors propose a hierarchial HMM frameworkfor modeling human activity. More recent hierarchical fusionstrategies include [5][6][7][8]. In [8], the authors develop aprobabilistic integration framework for fusion of audio visualcues at the track and identity levels. This is an exampleof fusion at multiple levels of abstraction. Similarly, in [9],the utility of head pose estimation and tracking for speechrecognition from distant microphones is explored. In [10],the authors use video localization to enhance the performanceof the beamformer for better speech reconstruction from far

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.

Page 2: Audio Visual Fusion and Tracking With Multilevel Iterative ...cvrr.ucsd.edu/publications/2010/ShivappaRaoTrivedi... · frameworks, human activity analysis, person tracking, iterative

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

2

field microphones. The utility of hierarchical fusion to developrobust human activity analysis algorithms is quite evidentfrom these existing examples. An in-depth analysis of thehierarchical fusion possibilities is not the focus of the presentpaper. However we note that person tracking is a fundamentalstep in many hierarchical audio-visual fusion schemes as seenin Table I.

In Figure 1, we present a flow diagram of the fusion ofmultimodal cues as developed in [11]. The audio and videosignals provide the person location information and this isfused in the audio-visual tracking step to come up with robustestimates of the 3D co-ordinates of the subjects. The trackinginformation is augmented with the speaker ID when availableand this betters the re-identification of the tracks in ambiguouscases. The location and head pose estimates are fused foreffective beamforming. The reconstructed clean speech fromthe beamformer is used by the speaker ID module whichidentifies the active speaker. The speech recognizer uses boththe speaker ID and the reconstructed speech to recognize fullspeech or spot keywords in the utterance. Thus, when thevarious blocks for audio-visual human activity analysis are puttogether, there is a whole range of fusion possibilities to makethe system more robust and effective. The fundamental stepin this promising scheme is the audio-visual person trackingalgorithm. The rest of this paper is an in-depth analysisof the iterative decoding based audio-visual person trackingframework, first described in [3].

III. PERSON TRACKING USING AUDIO-VISUAL CUES

Robust person tracking is the first step in facilitating detec-tion and analysis of human activity in a monitored space. It isalso an integral component of intelligent spaces, for facilitatingseamless interaction between humans and computers. Trackinghumans using audio-visual cues can provide robustness tobackground noise and visual clutter. Tracking based on visualsensors has been widely researched[12]. Microphone arraybased trackers that track sound sources have also been studiedby some researchers[13]. In this section we use the iterativedecoding algorithm to formulate a general fusion frameworkfor multimodal person tracking and apply it to track people inan indoor environment with multiple cameras and microphonearrays. We also present details of the experimental evaluationof the framework in our laboratory testbed. The evaluationis carefully designed to bring forth the true strengths of ourframework and its weaknesses. In section III-A, we presenta survey of related research and the comparative advantageof the MID-AVT framework. In section IV, we present themathematical formulation of the hidden Markov model basedMID-AVT framework. We describe the modeling, trainingand testing of such a system. In section V, we present thelaboratory testbed with multiple cameras and microphonearrays which was used for extensive experimentation andevaluation studies.

A. Existing Audio-visual person tracking schemes

In this section we present a brief survey of related researchactivities in the field of multimodal person tracking. We

also develop the motivation behind using iterative decodingframework to solve the tracking problem.

Person tracking has been a computer vision problem thatreceived considerable attention[12]. An good review of multi-camera trackers can be found in [14]. Audio source local-ization is also a well researched field [15][16]. Localizingand tracking individuals using audio-visual information hasrecently received much attention.

Early effort in tracking speakers using both audio andvideo cues involved camera epipolar constraint and audiocross correlation. In [17] one camera and two microphoneswere used and a single person was tracked. Spatial probabilitymaps were used in [18] to track a single speaker using twocameras and three microphones. [19] used a particle filter totrack one subject using one camera two microphones. [20]uses auditory epipolar geometry and face localization to trackmultiple people in the camera view using four microphones.Bayesian network based feature concatenation scheme wasexplored in [21] using one camera and two microphones.Audio-visual synchrony and correlation have been exploitedto locate speakers in [22][23][24]. These early efforts wereconstrained by the number of sensors used (usually one ortwo cameras and two to three microphones) and the scenecomplexity (usually one speaker was tracked).

Subsequent researchers have used Bayesian networks withthe particle filtering based inference technique in audio-visual tracking [19] [25] [26] [21] [27] [28] [29] [30] [31][32]. Approximate inference in the dynamic Bayesian net-work framework, necessitated by the complexity and non-Gaussianity of the joint models, is performed by the use ofparticle filters [31],[28]. In the recent past, the CLEAR 2006and CLEAR 2007 evaluation workshops [33][34] have beena significant research effort in evaluating audio-visual persontracking in meeting and lecture scenes. A wide variety offrameworks were developed and evaluated in these workshopson datasets collected under the initiative of the European CHIL(Computers in Human Interaction Loop) consortium. Amongthe techniques presented in CLEAR 2006 and CLEAR 2007,[32][35][36] are the closest matching schemes to the MID-AVT framework.

[36] describes an audio-visual 3-D person tracker that usesface detectors as the visual front-end and fuses detectionsfrom multiple views to obtain the 3-D location of the person’shead. If a speaker is active, the audio localization results arematched to the closest video track and continued to be tracked.If there is no match with the video tracks, the audio track istracked separately. The results indicate that though the videoface detection yields consistent results, the fusion of audiolocalization information does not perform well. In fact withthe addition of audio information the results are worse thanthe video-only results.

[35] describes an elaborate 3-D voxel based video trackeraugmented by the audio localization information. Views frommultiple cameras are combined to construct a 3-D voxelrepresentation of the subjects and this 3-D object is thentracked over time. One problem with such an approach is thatit relies heavily on the calibration of the cameras to obtainthe 3-D co-ordinates of object pixels. This sensitivity is a

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.

Page 3: Audio Visual Fusion and Tracking With Multilevel Iterative ...cvrr.ucsd.edu/publications/2010/ShivappaRaoTrivedi... · frameworks, human activity analysis, person tracking, iterative

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

3

TABLE ISUMMARY OF HIERARCHICAL FUSION STRATEGIES IN AUDIO-VISUAL HUMAN ACTIVITY ANALYSIS

Audio-visual tasks involved Publication and YearActive camera networks Trivedi et. al. [2][1] 2000Human activity recognition Oliver et. al. [4] 2004Group and individual activity recognition Zhang et. al. [5] 2004Speech Reconstruction - Person Tracking, Beamforming, SpeechRecognition

Maganti et. al. [10] 2007

Assitive meeting - Person tracking, Hand tracking, Speaker Orientationand Head pose

Dai and Xu [7] 2008

Identity tracking - Person Tracking, Face recognition, Speaker ID Bernardin et. al. [8] 2008Scene Understanding - Person Tracking, Head pose, Beamforming,Speaker ID and Keyword spotting

Shivappa et. al. [11] 2009

Speech Recognition

Speaker ID

Recognition

H d P  Beamforming

Audio‐visual Person Tracking

Head Pose Estimation

Audio Source Localization

Foreground Object Detection

Audio SensorsVideo Sensors

Fig. 1. Flowchart summarizing the exchange of audio and visual cues at multiple levels of semantic abstraction.

recurring feature in other schemes too. Another shortcoming in[35] is that the audio localization information is associated tothe video detections using data association techniques. Detailsof the data association technique used are not provided andone can assume that proximity based data association is onepossible solution. This could lead to many false detectionsbecause the audio detections are quite noisy. The results doindicate that the audio-visual tracker performs only as well asthe video-only tracker. In section V-D, we explore this in more

detail.

[32] presents a state-space based fusion strategy for associ-ating audio localization information with the video tracks. 3-Dtracks are maintained using a particle filter based tracker. Ifaudio detections are close to video tracks, they are associatedwith each other. If not, new tracks are created to explain theaudio detections till a matching video track is found. The 3-Dvideo tracker described here has the same sensitivity to cameracalibration mentioned above. In addition, a separate particle

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.

Page 4: Audio Visual Fusion and Tracking With Multilevel Iterative ...cvrr.ucsd.edu/publications/2010/ShivappaRaoTrivedi... · frameworks, human activity analysis, person tracking, iterative

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

4

filter is used for each person and hence an estimate of thenumber of people in the scene is necessary. Also, even whenthe number of subjects is known accurately, if some subjectsare not detected, the tracker tends to initialize false tracks toexplain the given number of subjects.

[31] presents an interesting particle filtering frameworkwhich incorporates the audio and visual detections into theparticle filtering framework. However the tracking frameworkpresented in [31] does not correspond to a 3-D tracker. Thecamera views are stitched to obtained a panoramic view ofthe room in which subjects are tracked. An advantage of thissystem is that the cameras need not be accurately calibrated.However this setup places restrictions on the positions thatthe subjects can occupy and is difficult to generalize to newscenes especially when larger number of people participate inmeetings and lectures.

[28] uses particle filtering to fuse audio and video detec-tions. To our knowledge, this is the closest approach to theMID-AVT framework. In [28], two overlapping camera viewsare used along with a microphone array to localize and tracksubjects. Occlusions are handled by multi-view and audiolocalizations. However the evaluation is limited to a simplescene and does not give us insight into the strengths andweaknesses of the framework. Also, the 3-D tracking relieson accurate calibration of the cameras.

B. Proposed Framework - MID-AVT

We present an alternative approach to fusion of audio-visualcues based on iterative decoding. We are interested in trackingmultiple people in a space instrumented with multiple sensors- cameras and microphone arrays. One important requirementof our scheme is that the overlapping fields of view shouldprovide robustness to occlusions. We also aim to overcome twomajor disadvantages with some of the existing schemes out-lined above, namely sensitivity to accurate sensor calibrationand the necessity to know the number of subjects in the scene.Our system is based on a rough calibration step similar to [31]but unlike [31] we do not constrain the scene complexity andthe tracking process actually incorporates multiple overlappingviews which allows us to track successfully through occlusionsin some views. Unlike [28], our framework is robust to sensorcalibration errors and also we evaluate our system on a real-world dataset of meeting recordings. In Section V-D, wecompare the performance of the MID-AVT framework withthat of the particle filter framework suggested in [28] onthe same dataset. We demonstrate the robustness to sensorcalibration and also compare the performance under differentscene and sensor configurations.

The MID-AVT framework is based on iterative decoding.The iterative decoding scheme as described in [37] is notapplicable to tracking as we need to solve the data asso-ciation problem[38] before using iterative decoding. In thenext section we present a hidden Markov model (HMM)based tracking framework which specifies the tracking prob-lem in a hierarchical manner, allowing the local sensors(camera/microphone array) to maintain track hypotheses andthe global tracker to fuse the local tracks from various sensors

to generate a robust estimate using iterative decoding. Thesame framework is also applicable to situations where multiplesensors are used to monitor disjoint spaces. In this case, onecannot expect robustness to sensor limitations as one wouldin the overlapping-field-of-view case.

The calibration of multimodal sensors is an important issuein tracking. In the MID-AVT framework, the system onlyrequires a rough calibration step. After this initial calibration,the system can continue tracking even if the sensors aredisturbed because we are tracking in the sensor co-ordinatesystem and not in the 3-D world co-ordinate system. If thecalibration is accurate, we can, in addition, infer the 3-Dco-ordinates of the subjects. This 3-D location informationis not necessary for our tracking algorithm to work. In ourexperimental evaluation we provide results that support thisclaim. This is an advantage over the particle filter basedtracking schemes because the particle filters are tracking inthe 3-D world co-ordinates and a mismatch in calibration ofthe sensors is not tolerable.

Also, the MID-AVT framework does not need to know thenumber of people in the scene. Every individual who presentsa signature on any of the sensors is detected and tracked. Thisis yet another advantage over [31] which assumes that thenumber of subjects in the scene is known and [32] whichassumes that the maximum number of subjects to be trackedis three.

In our present work, we use the iterative decoding principleon a HMM based framework for audio-visual tracking ofmultiple persons through multiple cameras and a networkof multiple microphone arrays. The framework is modularand hence easy to expand to more number of cameras andmicrophone arrays or any other sensors that can localizepersons. It is also applicable to sensors with overlapping andnon-overlapping field of ’view’. Since the placement of thesensors is assumed to be arbitrary but fixed, we only needa rough calibration scheme to establish the correspondencebetween sensors. The unimodal models considered in thispaper are simple and intuitive. The goal of the paper is todemonstrate the fusion algorithm and its applicability to thetracking scenario.

IV. MULTILEVEL ITERATIVE DECODING BASED AUDIOVISUAL TRACKING (MID-AVT) FRAMEWORK

We are interested in tracking multiple targets(people) in aspace instrumented with multiple cameras and microphonearrays. Each sensor detects the subjects in its field of viewand maintains an exhaustive list of possible track hypotheses.For example, if one tracked object occludes another, it involvestwo tracks converging and they may diverge again at a laterstage. However, when two tracks converge and diverge thereare four possible track hypotheses as shown in the first columnof Figure 2. Human motion in indoor environment is highlynon-linear and hence at the sensor level there is not enoughinformation to reject the false hypotheses. Once the informa-tion from other sensors is also available, a composite trackercan evaluate the likelihood of each hypothesis, incorporatingthe information from the other sensor and the hypotheses with

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.

Page 5: Audio Visual Fusion and Tracking With Multilevel Iterative ...cvrr.ucsd.edu/publications/2010/ShivappaRaoTrivedi... · frameworks, human activity analysis, person tracking, iterative

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

5

high likelihood are selected and tracked in the subsequent timeframes. This process is graphically depicted in the second andthird columns of Figure 2. Here, there are two distinct tracks insensor 2, because there in no occlusion in its view. When the4 hypotheses from sensor 1 are evaluated with the two distincttracks from sensor 2, only two hypotheses survive with highlikelihood. These surviving tracks are tracked in subsequenttime frames. This hypothesis selection process described inFigure 2 is intuitive and the iterative decoding algorithm givesus a statistical framework to implement it. In Figure 3 wepresent a high level flowchart of the MID-AVT framework.

A. Feature extraction for the cameras

In our experiment we choose a simple foreground objectdetection scheme. The foreground pixels in a frame are de-tected by background subtraction. They are then fused intoreliable blobs by morphological operations. We fit a boundingrectangle to each distinct blob. The pixel co-ordinates of thecenter of the ith rectangle, (xit, yit) and the area of therectangle [ait] are the components of the observation vectoroT

it = [xityitait]. For every frame at time t, for the jth camera,we maintain a list of the Mj detected foreground objectsoj

it, 1 ≤ i ≤ Mj .

B. Feature extraction for the microphone arrays

We use the time delay of arrival(TDOA) estimates betweenpairs of microphones in an array to estimate the location of thesound source. We use the generalized cross correlation basedphase transform (GCC-PHAT) framework [39][40] to locatesound sources if present. This technique has been the preferredmethod of TDOA estimation in established literature[31][41]as it has shown to be robust to reverberations. For sim-plicity, the TDOA estimates are computed on time windowsof audio samples corresponding to the interval between thecamera frames. We have a vector of TDOA values betweeneach microphone i and the reference microphone r, givenby ~τ = (τ1r, τ2r . . . τmr). The TDOA estimates form theobservation vector o1,t corresponding to the microphone array.Thus we reduce a microphone network to a 3-d localizersimilar to a camera. Note that the use of the SRP-PHATtechnique [16] would allow us to detect multiple sound sourcessimultaneously. We could then have Mt detected sources ateach time instance and a list of observations, oi,t, 1 ≤ i ≤ Mt.In the current paper, we limit ourselves to finding only onesound source at a time. Our audio setup however differsfrom [41] in the arrangement of microphones. Traditionalmicrophone arrays (linear/planar/spherical) have only angularresolution because the total span of the array is small comparedto the source location. Our microphone arrays have muchwider total span and provide better us with better resolutionin the TDOA space. However we need larger audio frames toaccurately estimate the TDOA in arrays with wider span.

After the observations are extracted for each frame of audio,the cameras and the microphone arrays are treated equivalentlyas in [41]. In the next section we refer to the camera or themicrophone array in general as a sensor.

C. Multiple hypotheses generation - local tracking

The object detection module associated with each sensordetects the foreground objects (or sound sources) in eachframe. In the presence of multiple objects of interest, alldistinguishable objects are detected by each camera. Falsepositive errors could occur in the presence of background noiseor clutter. False negative errors could occur due to occlusions.The tracking framework will address both these issues.

Consider frames from time t = 1 . . . T . We start with alist of features of detected objects oi,t, 1 ≤ i ≤ Mt at timet, where Mt is the number of detected objects at time t. Inour current setup, oi,t are image coordinates of the detectedobjects. More elaborate features such as size, color can also beadded under the same framework. We start with a set of initialtrack values for track j lj,0. At each time step, the tracks areupdated according to the rule lj,t = {oi,t|d(lj,t−1, oi,t) ≤ r},where d(x, y) is the Euclidean distance between x and y.If more than one observation lies within Euclidean distancer from lj,t−1, the old track is split to account for eachsuch observation. If no observation lies within radius r, weassign the past value lj,t−1 to the track. This correspondsto occlusions or the object leaving the field of ’view’ of thesensor. We can see that this is a very simple data associationframework and would result in a lot of false positives, as itmaintains tracks corresponding to all the possibilities in caseof any occlusions or merging and diverging of tracks. Onlythose possibilities are discarded where the data association canbe completed without ambiguity based on nearest neighbors.In the next step, using the information from other tracks, wereject the hypotheses that are unlikely under a probabilisticjoint model.

D. Multiple hypotheses selection and filtering- global tracking

Consider the set of all hypotheses hk = lj |1 ≤ j ≤ Nk

from sensor k which has Nk hypotheses. In the globaltracking step, we consider all possible combinations of thesehypotheses, one from each sensor. There are

∏k Nk such

combinations. We evaluate the likelihood of each combinationC = (l1j1 , l

2j2

. . . lNjN) under the iterative decoding framework

with HMM λk for sensor k. Spurious tracks have a lowlikelihood and are discarded. The remaining tracks are thenpassed down to the local trackers to use as initial tracks forthe next time window.

E. Iterative decoding algorithm

Consider a hidden Markov model Λk for sensor k with Nhidden states (see Figure 4). For clarity, we drop the sensorindex k. Λ has a parametric transition density. The hidden stateqt corresponds to the true location of the object at time t inthe same feature space as the observation vectors of sensor k.Thus the hidden states are, in a Bayesian sense, the temporallysmoothed observations. The conditional distribution of theobservation ot when the hidden state is qt is assumed tobe Gaussian. Now, the decoding problem is to estimate theoptimal state sequence QT

1 = {q1, q2 . . . qT } of the HMMbased on the sequence of observations OT

1 = {o1, o2 . . . oT }.

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.

Page 6: Audio Visual Fusion and Tracking With Multilevel Iterative ...cvrr.ucsd.edu/publications/2010/ShivappaRaoTrivedi... · frameworks, human activity analysis, person tracking, iterative

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

6

Fig. 2. The disambiguation of confusable hypotheses using the iterative decoding scheme is illustrated here. The first graph shows the tracks as seen in oneof the sensors. The next four images in the first column present the possible hypotheses that are plausible according to the first sensor alone. The secondand third columns have two tracks in the field of view of sensor 2. Note that both the second and third column correspond to the same sensor. The extrinsicinformation that these tracks provide sensor 1 are shown in the next eight images, superimposed with the four hypotheses from sensor 1. The two survivinghypotheses are marked in red.

The Maximum aposteriori probability (MAP) state at timet is calculated using the BCJR (Bahl Cocke Jelinek andRaviv) algorithm[42] which is also referred to as the forward-backward sum-product algorithm in the graphical modelscommunity. Note that any other inference technique can alsobe used. The MAP estimate for the hidden state at time tis given by q̂t = arg maxP (qt, O

T1 ). The BCJR algorithm

computes this using the forward and backward recursions.

The forward recursion variable αt(m), the backward recur-sion variable βt(m), the joint likelihood of the hidden stateand the observation sequence λt(m) and the recursion variableγt(m′,m) are defined as follows,

λt(m) =P (qt = m,OT1 ) (1)

αt(m) =P (qt = m,Ot1) (2)

βt(m) =P (OTt+1|qt = m) (3)

γt(m′,m) =P (qt = m, ot|qt−1 = m′) (4)

where, m = 1, 2 . . . N, m′ = 1, 2 . . . N

Then establish the recursions,

αt(m) =∑m′

αt−1(m′) · γt(m′,m) (5)

βt(m) =∑m′

βt+1(m′) · γt+1(m,m′) (6)

λt(m) =αt(m) · βt(m) (7)

At the first sensor HMM, we decode the hidden states usingthe observations from the first sensor. We obtain the aposterioriprobabilities, λ

(1)t (m) = P (qt = m,OT

1 ).

In the second sensor HMM, these aposteriori probabilities,λ

(1)t (m) are utilized as extrinsic information in decoding the

hidden states from the observations of the second sensor. Thusthe aposteriori probabilities in the second stage of decodingare given by λ

(2)t (m) = P (qt = m,OT

1 , Z(1)T

1 ) where Z(1)t =

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.

Page 7: Audio Visual Fusion and Tracking With Multilevel Iterative ...cvrr.ucsd.edu/publications/2010/ShivappaRaoTrivedi... · frameworks, human activity analysis, person tracking, iterative

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

7

Global Tracks Global Tracks Global Tracks

Hypothesis selection using Iterative Decoding Triangulation

Local Tracks Local Tracks Local Tracks3D 

co‐ordinates

Bounding box Bounding box

Data association by multiple hypothesis 

generation

Data association by multiple hypothesis 

generation

Data association by multiple hypothesis 

generation

Comparison and 

performance evaluation

Bounding boxco‐ordinates

Foreground object detection

Foreground object detection

Generalized cross‐correlation

TDOA vectorBounding boxco‐ordinates

M k

3D co‐ordinates

Camera 1 Camera 2

Marker detection and Triangulation

Microphone networkCamera 1 Camera 2

Fig. 3. The MID-AVT framework involving the local and global track hierarchies along with the groundtruth estimation procedure.

Fig. 4. The HMM for smoothing the observations from sensor k. Note that the hidden states q + t are described in the same feature space as the observationsot and hence we refer to them as the temporally smoothed observations.

λ(1)t is the extrinsic information from the first sensor.

λ(2)t (m) =P (qt = m,OT

1 , Z(1)T

1 ) (8)

α(2)t (m) =P (qt = m,Ot

1, Z(1)t

1) (9)

β(2)t (m) =P (OT

t+1, Z(1)T

t+1|qt = m) (10)

γ(2)t (m′,m) =P (qt = m, ot, Z

(1)t |qt−1 = m′) (11)

In order to distinguish the hidden states of sensor 1 fromthose of sensor 2 at time t, we denote them as q1,t and q2,t

respectively. Similarly the observations are denoted by o1,t

and o2,t. Then the recursions do not change, except for thecomputation of γ

(2)t (m′,m).Since the extrinsic information is

independent of the observations from the second modality,

γ(2)t (m′,m) =P (q2,t = m, o2,t, Z

(1)t |q2,t−1 = m′)

γ(2)t (m′,m) =P (q2,t = m|q2,t−1 = m′)

· P (o2,t|q2,t = m) · P (Z(1)t |q2,t = m)

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.

Page 8: Audio Visual Fusion and Tracking With Multilevel Iterative ...cvrr.ucsd.edu/publications/2010/ShivappaRaoTrivedi... · frameworks, human activity analysis, person tracking, iterative

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

8

γ(2)t (m′,m) = P (q2,t = m|q2,t−1 = m′) · P (o2,t|q2,t = m)

·∑

n

{P (Z(1)t |q1,t = n)P (q1,t = n|q2,t = m)}

where q2,t and o2,t correspond to the hidden state and obser-vation at time t for modality 2.

Assuming that P (Z(1)t |q1,t) = 1 if q1,t = arg maxn Z

(1)t,n

and 0 otherwise, where Z(1)t,n is the nth component of the vector

Z(1)t , which corresponds to a hard decision rule, we are now

left with the evaluation of P (q1,t = n|q2,t = m). In sectionIV-F, we describe the process of sensor calibration by whichwe obtain this distribution.

Alternatively, we can visualize the iterative decoding as fol-lows. Consider the HMM based tracker for each sensor k. Theobservation model for this HMM which defines the conditionaldistribution of the observation ok,t when the hidden state isqk,t is assumed to be Gaussian. Now, the iterative decodingalgorithm involves incorporating extrinsic information fromsensor k − 1 while decoding the hidden states of sensor k. Inorder to do so we augment the observation model to includethe extrinsic information as well. Let us denote by Z

(k−1)t ,

the extrinsic information from sensor k − 1. The augmentedobservation model is now represented as

P (ok,t, Z(k−1)t |qk,t) =P (ot|qt) · P (Z(k−1)

t |q2,t)

P (ok,t, Z(k−1)t |qk,t) =P (ot|qt) · P (Z(k−1)

t |q1,t) · P (q1,t|q2,t)

Assuming that P (Z(k−1)t |qk−1,t) = 1 if qk−1,t =

arg maxn Z(k−1)t,n and 0 otherwise, where Z

(k−1)t,n is the nth

component of the vector Z(k−1)t , which corresponds to a

hard decision rule, we are now left with the evaluation ofP (qk−1,t = n|qk,t = m). In section IV-F, we describethe process of sensor calibration by which we obtain thisdistribution.

We proceed to sensor k + 1 with the extrinsic informationZ(k) from sensor k. We proceed likewise till we decode thehidden states of the last sensor from the extrinsic informationof the previous sensor. In the next iteration, we use theextrinsic information of the last sensor to decode the hiddenstates of the first sensor. Then the second iteration proceedsas the first, with updated state sequences. Finally we thresholdthe overall log-likelihood of the track combinations to selectthe surviving tracks in each sensor ’view’.

F. Sensor Calibration

The camera and microphone locations are assumed to bearbitrary but fixed. Hence we need a rough calibration stepto establish a relationship between the state space of dif-ferent sensors. In the iterative decoding algorithm presentedin section IV-E, we are left with problem of estimatingP (q1,t = n|q2,t = m) for sensor pair (1, 2). There areefficient ways of learning and storing this distribution by usingdecision trees, piecewise linear approximations and kernelbased density estimation techniques [43]. In our experiments

we use a simple kernel density estimation scheme to estimatethe conditional distribution P (q1,t = n|q2,t = m), by firstestimating the joint distribution P (q1,t = n, q2,t = m)from a set of training points collected during the calibrationstep. In order to collect training points, we have an initialcalibration step where a single person carrying a sound sourcewalks around the space monitored by the sensors. Trackingis now trivial as there is only one object. The observationsfrom several frames are used to estimate the joint distributionP (q1,t = n, q2,t = m) using a Gaussian kernel of appropriatebandwidth for smoothing.

During the initial calibration phase, a person carrying asound source walks around the room. From the audio signals,the TDOA vector corresponding to the sound source is com-puted and from the video frames, the (x, y) pixel co-ordinateof the foreground object is obtained. Note that our calibrationstep establishes correspondences between sensors in the sensorco-ordinate system. We do consider the calibration of thecameras and microphone arrays to the world co-ordinatesystem. However this is only to measure the ground truth forevaluating the accuracy of the tracker and to compare it withother tracking schemes.

V. EXPERIMENTAL EVALUATION

A. Test bed detailsIn this section we present the details of our laboratory

testbed with multiple cameras and microphone arrays. Thetestbed is located in the Smartspaces lab at CALIT2 in theUniversity of California, San Diego. The testbed is equippedwith 24 microphones and 4 cameras. The layout is shown inFigure 6. The cameras have significantly overlapping field ofview and different perspectives. The cameras have a resolutionof 640x480 pixels and capture frames, synchronously, witheach other and the microphones, at 7.5 fps. The audio signalis sampled at 44.1kHz. There are four microphone arrays with4 microphones each, arranged in the form of a cross withdimensions 40cm x 40 cm. In addition there is a circulararray with 6 microphones in the center of the table and twomicrophones at the end of the table. The cameras have aoverlapping field of view with different perspectives.

B. Ground truth estimationIn order to obtain the ground truth, we use standard

chessboard pattern based camera calibration techniques tocalibrate the cameras with respect to the world co-ordinates.The microphones are manually located in the camera view andtheir location is estimated by triangulation. A sound sourcewith a bright source of light is moved around the monitoredspace. By triangulation, the position of the light is accuratelydetermined at each frame. The positions of the microphonesare then optimized to match the TDOA values obtained ateach frame with those computed from the sound source co-ordinates. This calibration allows us to obtain visual-markerbased ground-truth estimates for comparison of our results.The location estimate from the triangulation procedure wascompared with the actual location measurement. The standarddeviation of the error was 2.4 cm on a test set that involved100 different spots distributed in the room .

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.

Page 9: Audio Visual Fusion and Tracking With Multilevel Iterative ...cvrr.ucsd.edu/publications/2010/ShivappaRaoTrivedi... · frameworks, human activity analysis, person tracking, iterative

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

9

�1000 0 1000 2000 3000 4000 5000�1000

0

1000

2000

3000

4000

5000

X ���> (millimeters)

Y ��

�> (m

illim

eter

s)

Pathways marked on the floorGroundtruth from markersGlobal track estimate

Fig. 5. A track and its associated ground truth in the world co-ordinates.

C. Datasets

Meetings among the lab members were recorded for theevaluation of the MID-AVT framework. The meetings consistof 4 to 6 subjects. There are clips where the subjects are eitherinvolved in a discussion or one person is giving a presentation.In collecting this dataset (MID-AVT-UCSD-1), we have triedto keep the sensor configurations comparable to the CHILmeeting rooms [44] which were used in the CLEAR 2006 andCLEAR 2007 evaluation workshops. During presentations andmeetings, there is not much movement among the participantsand usually only one speaker is active at a particular time,which is true for a majority of the time in many meetings.The individual segments range from 5 minute to 15 minutes induration. Some meeting segments were annotated by manuallymarking the position of the subjects’ head once every secondfor a total for a total of 1200 seconds. This corresponds to9000 frames and these frames were used in our evaluation.

In addition we also have a separate dataset (MID-AVT-UCSD-2) of scenes involving 1-4 subjects that involves a lotmore movement of the subjects. This dataset involves multiplesubjects who are involved in a continuous conversation withmostly one active speaker at any time, moving around inthe room. This dataset has significant number of occlusionsand tracks converge and diverge frequently. This dataset hasshorter clips ranging from 1 to 5 minutes and the evaluationis presented on a total of 3000 frames which involve about30 occlusions which were manually detected and marked forevaluation. There are only two cameras and a total of 8microphones in this dataset.

D. Evaluation Results

We implemented the MID-AVT framework and the parti-cle filter based tracker from [28] on the MID-AVT-UCSD-1dataset. For the HMMs in the MID-AVT framework, we used

500 hidden states per sensor and we used 100 particles foreach subject in the particle filter. We used all four cameras andthe four cross shaped microphone arrays. Neither algorithmwas implemented in real-time, however the iterative decodingalgorithm was 2.5 times slower than the particle filteringapproach. Moreover, the iterative decoding was carried out onblocks of length 5 seconds and hence there is a minimum delayof 5 seconds in generating the global tracks. However there areapplications such as automatic meeting summarization wheresuch a delay is tolerable.

The tracker is evaluated by counting the number of framesa subject is tracked correctly (tracker output matches groundtruth location by 500 mm). The MID-AVT scheme had anaverage accuracy of 76% on the MID-AVT-UCSD-1 datasetwhile tracking all the subjects in the meeting scene. The errorswere mostly missed detections involving subjects who blendedin with the background due to dark clothing and remainedsilent for most of the meetings. In Figure 7. we show thedifferent views of the meeting scene. Note that one of thesubjects is completely missing in the tracker output. Also, wetracked the active speaker using the audio detections aloneand associating this detection with the corresponding globaltrack during the course of the meetings and found that theactive speaker was accurately found in 85% of total frames.In Figure 8 we show snapshots from the global trackingprocess as seen from one of the camera views. Note that theactive-speaker tracking tracks the different active speakers asthey take turns in the conversation. However in the meetingscenes there is only one dominant speaker and hence theaudio observations do not improve the localization accuracyof the tracker. Also, there is not much movement of the seatedparticipants which is not a very challenging tracking scenario.The average root mean-squared error of the speaker locationwas 21cm. The particle filter based tracker was evaluated andwas found to perform with an accuracy of 74%. We do not

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.

Page 10: Audio Visual Fusion and Tracking With Multilevel Iterative ...cvrr.ucsd.edu/publications/2010/ShivappaRaoTrivedi... · frameworks, human activity analysis, person tracking, iterative

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

10

see an appreciable difference between the performance of thetwo trackers.

Hence we also present results on MID-AVT-UCSD-2dataset. In Figure 9 we show a snapshot of the scene. InTable II, we show the fraction of times the global trackersuccessfully resolves the ambiguity during occlusions andnoisy detections based on the information from the othersensors. In Figure 5, we show one of the tracks from a clipand the associated groundtruth. The root mean squared errorbetween the track and the ground truth is 11cm. We see thatagain the performance of the particle filter scheme is verysimilar to that of MID-AVT framework.

E. Sensitivity to sensor calibration

In order to demonstrate the robustness of the iterativedecoding scheme, we simulated the calibration mismatch byapplying a small fixed random rotation transformation toeach camera view. This corresponds approximately to thecase where the camera calibration is inaccurate. In this newconfiguration we conducted the experiments on the MID-AVT-UCSD-2 dataset and the results are presented in Table III. Fiverandom rotation transformations ( and in each case differentcameras were perturbed by different angles ) were applied tothe videos. Each rotation was selected randomly to lie between−10o and 10o around the camera axis. The average resultsare shown in Table III. We see that the performance of theparticle filter tracker degrades considerably while the proposedframework maintains the tracking accuracy. The particle filtermaintains the tracks in the 3-D co-ordinates and hence themismatched calibration affects the tracking process. However,in our MID-AVT framework, local tracking occurs in theimage co-ordinates and is robust to calibration errors. Note thathere we have only applied a small perturbation to the sensorconfiguration. Further research is necessary to fully exploit thisrobustness to achieve automatically re-calibrating trackers.

VI. CONCLUDING REMARKS

We have proposed a novel audio-visual person trackingscheme. The scheme performs very close to the popularparticle filtering based tracking scheme. In addition it is robustto calibration errors. This shows a lot of promise and in thefuture we would like to explore the possibility of developingan auto calibrating camera and microphone network. Also, thedevelopment of a person tracking scheme is the first step inrealizing a more comprehensive fusion scheme which involvesother tasks such as person identification, speech recogni-tion, affective state recognition under unconstrained situations,finally facilitating natural human-computer interaction. Thedatasets developed for the evaluation of the tracking schemewill be useful in comparing existing frameworks and buildingbetter tracking systems for real-world meetings.

VII. ACKNOWLEDGEMENTS

Work described in this paper was partly funded by theRESCUE project at UCSD, NSF award #0331690. We alsothank CALIT2 at UCSD for the assistance with the Smartspacelab testbed. We acknowledge the assistance and cooperation

of our colleagues from the CVRR Laboratory. We also like tothank the reviewers whose insightful comments during boththe rounds of reviews have helped us re-organize and greatlyimprove the clarity of the paper.

REFERENCES

[1] M. M. Trivedi, K. S. Huang, and I. Mikic, “Dynamic context captureand distributed video arrays for intelligent spaces,” IEEE Transactionson Systems, Man and Cybernetics, Part A, 2005.

[2] M. M. Trivedi, I. Mikic, and S. Bhonsle, “Active camera networksand semantic event databases for intelligent environments,” IEEE CVPRWorkshop on Human Modeling, Analysis and Synthesis, 2000.

[3] S. T. Shivappa, M. M. Trivedi, and B. D. Rao, “Person trackingwith audio-visual cues using the iterative decoding framework,” in 5thIEEE International Conference On Advanced Video and Signal BasedSurveillance, 2008 [Best Paper Award].

[4] N. Oliver, A. Garg, and E. Horvitz, “Layered representations for learningand inferring office activity from multiple sensory channels,” ComputerVision and Image Understanding, 2004.

[5] D. Zhang, D. Gatica-Perez, S. Bengio, I. McCowan, and G. Lathoud,“Modeling individual and group actions in meetings: A two-layer hmmframework,” IEEE International Conference on Computer Vision andPattern Recognition, June 2004.

[6] N. M. Oliver, B. Rosario, and A. Pentland, “A Bayesian computer visionsystem for modeling human interactions,” IEEE Transactions on PatternAnalysis and Machine Intelligence, 2000.

[7] P. Dai and G. Xu, “Context-aware computing for assistive meetingsystem,” in Proceedings of the 1st international conference on PErvasiveTechnologies Related to Assistive Environments, 2008.

[8] K. Bernardin, R. Stiefelhagen, and A. Waibel, “Probabilistic integrationof sparse audio-visual cues for identity tracking,” in Proceeding of the16th ACM international conference on Multimedia, 2008.

[9] S. T. Shivappa, B. D. Rao, and M. M. Trivedi, “Role of head poseestimation in speech acquisition from distant microphones,” in IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP), 2009.

[10] H. K. Maganti, D. Gatica-Perez, and I. McCowan, “Speech enhancementand recognition in meetings with an audio-visual sensor array,” IEEETrans. on Audio, Speech, and Language Processing, Nov 2007.

[11] S. T. Shivappa, M. Trivedi, and B. D. Rao, “Hierarchical audio-visualcue integration framework for activity analysis in intelligent meetingrooms,” in IEEE CVPR Workshop: ViSU’09, 2009.

[12] A. Yilmaz, O. Javed, and M. Shah, “Object tracking: A survey,” ACMComput. Surv., 2006.

[13] M. Brandstein and D. Ward, Microphone Arrays. Springer, 2001.[14] K. S. Huang and M. M. Trivedi, “Video arrays for real-time tracking

of person, head, and face in an intelligent room,” Machine Vision andApplications, 2003.

[15] T. Gustafsson, B. D. Rao, and M. M. Trivedi, “Source localizationin reverberant environments: Modeling and statistical analysis,” IEEETransactions on Speech and Audio Processing, Nov. 2003.

[16] J. H. DiBiase, H. F. Silverman, and M. S. Brandstein, “Robust local-ization in reverberant rooms,” Microphone Arrays: Signal ProcessingTechniques and Applications, 2001.

[17] G. Pingali, G. Tunali, and I. Carlbom, “Audio-visual tracking fornatural interactivity,” in Proceedings of the seventh ACM internationalconference on Multimedia (Part 1), 1999.

[18] S. G. Z. P. Aarabi, “Robust sound localization using multi-sourceaudiovisual information fusion,” Information Fusion, 2001.

[19] J. Vermaak, M. Gangnet, A. Blake, and P. Perez, “Sequential montecarlo fusion of sound and vision for speaker tracking,” Eighth IEEEInternational Conference on Computer Vision.

[20] K. Nakadai, K. Hidai, H. Mizoguchi, H. G. Okuno, and H. Kitano,“Real-time auditory and visual multiple-object tracking for humanoids,”in IJCAI, 2001.

[21] M. Beal, N. Jojic, and H. Attias, “A graphical model for audiovisualobject tracking,” IEEE Trans. on Pattern Analysis and Machine Intelli-gence, 2003.

[22] R. Cutler and L. S. Davis, “Look who’s talking: Speaker detectionusing video and audio correlation,” in IEEE International Conferenceon Multimedia and Expo (III), 2000.

[23] J. W. Fisher, T. Darrell, W. T. Freeman, and P. A. Viola, “Learningjoint statistical models for audio-visual fusion and segregation,” in NIPS,2000.

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.

Page 11: Audio Visual Fusion and Tracking With Multilevel Iterative ...cvrr.ucsd.edu/publications/2010/ShivappaRaoTrivedi... · frameworks, human activity analysis, person tracking, iterative

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

11

Cam 4

4 element4 elementmic‐array

~ 8000 mm

mic‐array

Cam 1 Cam 3Single mic

6 elementmic‐array

Single mic

00 m

m~ 360

4 element

4 elementmic‐array

Cam 2

mic‐array Projector Screen

Fig. 6. The configuration of the meeting room for data set 1. The 4 cameras and 24 microphones are shown with their approximate fields of view. Thedimensions of the room are approximately 360 cm x 800 cm.

[24] J. Hershey and J. Movellan, “Audio vision: Using audiovisual synchronyto locate sounds,” in NIPS, 2000.

[25] D. N. Zotkin, R. Duraiswami, and L. S. Davis, “Joint audio-visualtracking using particle filters,” EURASIP J. Appl. Signal Process., 2002.

[26] D. G. Perez, G. Lathoud, I. McCowan, J. M. Odobez, and D. Moore,“Audio-visual speaker tracking with importance particle filters,” 2003.

[27] Y. Chen and Y. Rui, “Real-time speaker tracking using particle filtersensor fusion,” Proceedings of the IEEE, 2004.

[28] N. Checka, K. W. Wilson, M. R. Siracusa, and T. Darrell, “Multipleperson and speaker activity tracking with a particle filter,” in Proceedingsof IEEE International Conference on Acoustics, Speech, and SignalProcessing, 2004.

[29] K. Nickel, T. Gehrig, R. Stiefelhagen, and J. McDonough, “A jointparticle filter for audio-visual speaker tracking,” in Proceedings of the7th international conference on multimodal interfaces, 2005.

[30] T. Gehrig, K. Nickel, H. K. Ekenel, U. Klee, and J. McDonough,“Kalman filters for audio-video source localization,” IEEE Workshop onApplications of Signal Processing to Audio and Acoustics., Oct. 2005.

[31] D. Gatica-Perez, G. Lathoud, J. Odobez, and I. McCowan, “Audiovisualprobabilistic tracking of multiple speakers in meetings,” IEEE Transac-tions on Audio, Speech and Language Processing, 2007.

[32] K. Bernardin and R. Stiefelhagen, “Audio-visual multi-person trackingand identification for smart environments,” in Proceedings of ACMInternational Conference on Multimedia, 2007.

[33] R. Stiefelhagen, H. K. Ekenel, C. Fugen, P. Gieselmann, H. Holzapfel,F. Kraft, K. Nickel, M. Voit, and A. Waibel, “Enabling multimodalhumanrobot interaction for the karlsruhe humanoid robot,” IEEE Trans-actions on Robotics, Oct. 2007.

[34] R. Stiefelhagen, R. Bowers, and J. Fiscus, Multimodal Technologiesfor Perception of Humans: International Evaluation Workshops CLEAR2007 and RT 2007 (Lecture Notes in Computer Science).

[35] A. Abad, C. Canton-Ferrer, C. Segura, J. L. Landabaso, D. Macho, J. R.Casas, J. Hernando, M. Pardas, and C. Nadeu, “Upc audio, video andmultimodal person tracking systems in the clear evaluation campaign,”

Proceedings of the First International CLEAR Evaluation Workshop -Multimodal Technologies for Perception of Humans, 2007.

[36] N. Katsarakis, F. Talantzis, A. Pnevmatikakis, and L. Polymenakos,“The ait 3d audio / visual person tracker for clear 2007,” Proceedingsof the First International CLEAR Evaluation Workshop - MultimodalTechnologies for Perception of Humans, 2007.

[37] S. T. Shivappa, B. D. Rao, and M. M. Trivedi, “An iterative decodingalgorithm for fusion of multi-modal information,” EURASIP Journalon Advances in Signal Processing - Special Issue on Human-ActivityAnalysis in Multimedia Data, 2008.

[38] Y. Bar-Shalom and T. E. Fortmann, Tracking and Data Association.Academic Press, 1988.

[39] C. H. Knapp and G. C. Carter, “The generalized correlation method forestimation of time delay,” IEEE Transactions on Acoustic, Speech andSignal Processing, 1976.

[40] M. Brandstein and H. Silverman, “A robust method for speech signaltime-delay estimation in reverberant rooms,” 1997.

[41] A. ODonovan and R. Duraiswami, “Microphone arrays as generalizedcameras for integrated audio visual processing,” in Proceedings of IEEEInternational Conference on Computer Vision and Pattern Recognition,2007.

[42] L. Bahl, J. Cocke, F. Jelinek, and J. Raviv, “Optimal decoding oflinear codes for minimizing symbol error rate,” IEEE Transactions onInformation Theory, Mar. 1974.

[43] S. Dasgupta and Y. Freund, “Random projection trees for vector quan-tization,” IEEE Transactions on Information Theory, 2009.

[44] D. Mostefa, N. Moreau, K. Choukri, G. Potamianos, S. M. Chu,J. R. Casas, J. Turmo, L. Cristoferetti, F. Tobia, A. Pnevmatikakis,V. Mylonakis, F. Talantzis, S. Burger, R. Stiefelhagen, K. Bernardin,and C. Rochet, “The CHIL audiovisual corpus for lecture and meetinganalysis inside smart rooms,” Journal on Language Resources andEvaluation, Dec 2007.

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.

Page 12: Audio Visual Fusion and Tracking With Multilevel Iterative ...cvrr.ucsd.edu/publications/2010/ShivappaRaoTrivedi... · frameworks, human activity analysis, person tracking, iterative

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

12

Fig. 7. A snapshot showing the different views from the tracker. Note that at the moment the snapshot was taken, one subject was missed by the trackerdue to lack of contrast with the background. He also remained silent during the meeting and was not picked up by the audio localizer either.

Shankar T. Shivappa Shankar T. Shivappa receivedhis B.Tech. and M.Tech degrees in Electrical En-gineering from the Indian Institute of Technology,Madras, India, in 2004. He is currently a Ph.D. can-didate in the Department of Electrical and Computerngineering at UCSD. His research interests lie inthe areas of multimodal signal processing, machinelearning, Speech and audio processing and computervision. He has interned at AT&T labs during thesummer of 2005 and at Microsoft Research duringthe summer of 2009. His paper, co-authored with

Mohan Trivedi and Bhaskar Rao, received the best paper award at AVSS2008. He is currently actively involved in the running of the Smart spacelaboratory at the California Institute for Telecommunication and InformationTechnologies [Cal-IT2], UCSD.

Bhaskar D. Rao Bhaskar D. Rao received theB.Tech. degree in electronics and electrical com-munication engineering from the Indian Instituteof Technology, Kharagpur, India, in 1979 and theM.S. and Ph.D. degrees from the University ofSouthern California, Los Angeles, in 1981 and 1983,respectively. Since 1983, he has been with the Uni-versity of California at San Diego, La Jolla, wherehe is currently a Professor with the Electrical andComputer Engineering Department and holder of theEricsson endowed chair in wireless access networks.

His interests are in the areas of digital signal processing, estimation theory,and optimization theory, with applications to digital communications, speechsignal processing, and human-computer interactions. He is the holder of theEricsson endowed chair in Wirless Access Networks and is the Director of theCenter for Wireless Communications. His research group has received severalpaper awards. His paper received the best paper award at the 2000 speechcoding workshop and his students have received student paper awards at boththe 2005 and 2006 International conference on Acoustics, Speech and SignalProcessing conference as well as the best student paper award at NIPS 2006. Apaper he co-authored with B. Song and R. Cruz received the 2008 Stephen O.Rice Prize Paper Award in the Field of Communications Systems and a paperhe co-authored with S. Shivappa and M. Trivedi received the best paper awardat AVSS 2008. He also received the graduate teaching award from the graduatestudents in the Electrical Engineering department at UCSD in 1998. He waselected to the fellow grade in 2000 for his contributions in high resolutionspectral estimation. Dr. Rao has been a member of the Statistical Signaland Array Processing technical committee, the Signal Processing Theoryand Methods technical committee as well as the Communications technicalcommittee of the IEEE Signal Processing Society. He currently serves on theeditorial board of the EURASIP Signal Processing Journal.

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.

Page 13: Audio Visual Fusion and Tracking With Multilevel Iterative ...cvrr.ucsd.edu/publications/2010/ShivappaRaoTrivedi... · frameworks, human activity analysis, person tracking, iterative

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

13

Frame 150 Frame 255

Frame 350 Frame 380

Frame 450 Frame 465

Fig. 8. Different snapshots during a meeting illustrate the active speaker tracking that highlights the current active speaker by drawing a circle around thehead of the associated track.

Mohan M. Trivedi Mohan Manubhai Trivedi (LF2009) received the B.E. degree (with honors) fromthe Birla Institute of Technology and Science, Pilani,India, and the Ph.D. degree from Utah State Univer-sity, Logan. He is currently a Professor of electricaland computer engineering and the Founding Direc-tor of the Computer Vision and Robotics ResearchLaboratory, University of California, San Diego(UCSD), La Jolla. He has established the Laboratoryfor Intelligent and Safe Automobiles (LISA), UCSD,to pursue a multidisciplinary research agenda. He

and his team are currently pursuing research in active vision, visual learn-ing, distributed intelligent systems, human body modeling and movementanalysis, multimodal affect analysis, intelligent driver assistance, semanticinformation analysis and active safety systems for automobiles. He serves

on the Executive Committees of the University of California Digital MediaInnovation Program and of the California Institute for Telecommunication andInformation Technologies [Cal-IT2] as the Leader of the Intelligent Trans-portation and Telematics Layer, UCSD. He regularly serves as a consultantto industry and government agencies in the U.S. and abroad. He has givenover 50 keynote/Plenary talks. Trivedi is serving as an Expert Panelist for theStrategic Highway Research Program (Safety) of the National Academy ofSciences. He is currently an Associate Editor of the IEEE Transactions onintelligent transportation systems. He was the recipient of the DistinguishedAlumnus Award from Utah State University, Pioneer (Technical Activities)and Meritorious Service Awards from the IEEE Computer Society, and anumber of Best Paper Awards. Trivedi is a Fellow of the IEEE and the SPIE.

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.

Page 14: Audio Visual Fusion and Tracking With Multilevel Iterative ...cvrr.ucsd.edu/publications/2010/ShivappaRaoTrivedi... · frameworks, human activity analysis, person tracking, iterative

Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing [email protected].

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.

14

Fig. 9. A snapshot from the tracking process on data set 2 which involves subjects moving continuously and hence resulting a lot of occlusions, with tracksmerging and diverging in camera views.

TABLE IIRESULTS FROM MID-AVT-UCSD-2 - PERCENTAGE OF OCCLUSIONS THAT ARE CORRECTLY RESOLVED BY THE MID-AVT FRAMEWORK INCOMPARISON WITH THE PARTICLE FILTERING BASED TRACKER. NOTE THAT THE PERFORMANCE OF THE TWO SCHEMES IS VERY SIMILAR.

1 camera Microphones 1 camera andmicrophones

2 cameras 2 cameras andmicrophones

MID-AVTframework

1 subject 95% 76% 95% 98% 98%2 subjects 53% 42% 68% 85% 87%3 subjects 38% 40% 65% 80% 83%4 subjects 33% 34% 55% 69% 73%

Particle filterbased tracker

4 subjects 30% 20% 35% 69% 74%

TABLE IIIRESULTS FROM MID-AVT-UCSD-2 DATASET (4 SUBJECT CASE) WHEN A RANDOM ROTATION TRANSFORMATION IS APPLIED TO THE CAMERA VIEWS -

PERCENTAGE OF OCCLUSIONS THAT ARE CORRECTLY RESOLVED BY THE TRACKER IS SHOWN IN THE TABLE. THIS DEMONSTRATES THAT THE MID-AVTFRAMEWORK IS ROBUST TO SMALL CAMERA CALIBRATION ERRORS.

1 camera Microphones 1 camera and mi-crophones

2 cameras 2 cameras andmicrophones

MID-AVTframework

33% 34% 55% 68% 70%

Particle filterbased tracker

20% 20% 35% 49% 54%

Authorized licensed use limited to: Univ of Calif San Diego. Downloaded on July 23,2010 at 16:40:16 UTC from IEEE Xplore. Restrictions apply.


Recommended