+ All Categories
Home > Documents > Audio Visual Cues in Driver Affect Characterization: Issues and Challenges...

Audio Visual Cues in Driver Affect Characterization: Issues and Challenges...

Date post: 10-Apr-2020
Category:
Upload: others
View: 0 times
Download: 0 times
Share this document with a friend
6
Audio Visual Cues in Driver Affect Characterization: Issues and Challenges in Developing Robust Approaches Ashish Tawari and Mohan M Trivedi University of California San Diego, LISA: Laboratory for Intelligent and Safe Automobiles [email protected], [email protected] Abstract— Computer vision, speech and machine learning technologies play an important role and are increasingly used in today’s vehicles to improve the safety as well as comfort in the car. Driving in particular presents a context in which a user’s emotional state plays a significant role. Emotions have been found to affect cognitive style and performance. Even mildly positive feeling can have a profound effect on the flexibility and efficiency of thinking and problem solving. In this paper, we review some of the existing approaches for analyzing in- vehicle driver affect using audio and visual cues. We will discuss challenges in developing robust system and hopefully provide some insight in practical realization of such system. In particular, we present our ongoing efforts in collecting driving data using simulator as well as real world driving testbeds, and propose to utilize a multilevel audio-visual fusion scheme to utilize contextual information often available in co-existing tasks in an intelligent system. I. I NTRODUCTION The activity of driving requires the involvement of a variety of human resources. For instance, to drive a vehicle on a busy city road effectively and safely, divers must co- ordinate their cognitive, physical and emotional capabilities simultaneously. If one of these is affected, it will invariably influence the others, as a result affecting the entire driving ex- perience. Furthermore, throughout the driving activity drivers simultaneously interact with the vehicle’s interface including radio, air-condition and other on-board equipment. As such, it is essential that the interface is designed appropriately so as not to load the cognitive, physical or emotional resources of the driver. II. RESEARCH MOTIVATION Computer vision, speech and machine learning technolo- gies play an important role and are increasingly used in today’s vehicles to improve the safety as well as comfort. However, to be effective, such technologies need to be human centric and need to work in a holistic manner which takes into account different components of the system involving driver (e.g. looking at driver to recognize driver activity and attention state), vehicle (e.g. looking at vehicle speed, steering angle, braking), and vehicle surround (e.g. looking at road and other cars to understand surrounding situation) [1], [2]. Among those components, we consider in this paper the part of looking at the driver which is a very important component in driver assistance systems (it is shown that a large portion of accidents is caused by human errors like driver inattention or cognitive overload [3]. Therefore it is important for any intelligent vehicle to be able to interface with the driver to relay information (through any number of feedback channels, e.g. auditory, visual, haptic) or detect situations within the vehicle that might lead to unsafe driving (e.g. driver drowsiness or inattention). Often times the driver of a vehicle can be inattentive to the driving task resulting in dangerous driving situations. This inattentiveness can be cause by a variety of circumstances such as cell phone usage, drowsiness, or distractions caused by other occupants. Another important aspects is emotional experience of driving and interacting with the vehicle interface, and how to approach the design of vehicle interfaces in order to support positive (and avoid negative) emotional experiences [4], [5]. Driving in particular presents a context in which a user’s emotional state plays a significant role. Emotions have been found to affect cognitive style and performance. Even mildly positive feeling can have a profound effect on the flexibility and efficiency of thinking and problem solving The road- rage phenomenon [6] is a well recognized situation where emotions can directly impact safety. Such phenomenon can be mitigated and driving performance can be improved if the car actively responds to the emotional state of the driver. Research studies show that matching in-car voice and driver’s emotional state has great impact on driving performance [7]. Research emotion recognition is largely influenced by the basic emotion theory [8]. Most of the existing efforts in this direction aim at the recognition of a subset of basic emotions (happy, sad, surprise, anger, fear, disgust and neu- tral). In recent years, however, few studies have been made focusing on certain application dependent affective states, ‘frustration’, for example, in driving context [9]. Another model for articulating emotions is that of Russell’s Core Affect model [10] (Figure 1). Russell suggests to categorize emotion by understanding in two dimensional space: valence (pleasure-displeasure) and arousal (activation-deactivation). [11] have used such model to analyze driver’s experience during interaction with the vehicle interface while driving in high traffic. Literature also mentions other dimensions. A third dimension often used is - dominance, which represents Proceedings of International Joint Conference on Neural Networks, San Jose, California, USA, July 31 – August 5, 2011 978-1-4244-9636-5/11/$26.00 ©2011 IEEE 2997
Transcript
Page 1: Audio Visual Cues in Driver Affect Characterization: Issues and Challenges …swiftlet.ucsd.edu/publications/2011/Tawari_IJCNN2011.pdf · 2011-12-21 · Audio Visual Cues in Driver

Audio Visual Cues in Driver AffectCharacterization: Issues and Challenges in

Developing Robust ApproachesAshish Tawari and Mohan M TrivediUniversity of California San Diego,

LISA: Laboratory for Intelligent and Safe [email protected], [email protected]

Abstract— Computer vision, speech and machine learningtechnologies play an important role and are increasingly used intoday’s vehicles to improve the safety as well as comfort in thecar. Driving in particular presents a context in which a user’semotional state plays a significant role. Emotions have beenfound to affect cognitive style and performance. Even mildlypositive feeling can have a profound effect on the flexibilityand efficiency of thinking and problem solving. In this paper,we review some of the existing approaches for analyzing in-vehicle driver affect using audio and visual cues. We willdiscuss challenges in developing robust system and hopefullyprovide some insight in practical realization of such system. Inparticular, we present our ongoing efforts in collecting drivingdata using simulator as well as real world driving testbeds,and propose to utilize a multilevel audio-visual fusion schemeto utilize contextual information often available in co-existingtasks in an intelligent system.

I. INTRODUCTION

The activity of driving requires the involvement of avariety of human resources. For instance, to drive a vehicleon a busy city road effectively and safely, divers must co-ordinate their cognitive, physical and emotional capabilitiessimultaneously. If one of these is affected, it will invariablyinfluence the others, as a result affecting the entire driving ex-perience. Furthermore, throughout the driving activity driverssimultaneously interact with the vehicle’s interface includingradio, air-condition and other on-board equipment. As such,it is essential that the interface is designed appropriately soas not to load the cognitive, physical or emotional resourcesof the driver.

II. RESEARCH MOTIVATION

Computer vision, speech and machine learning technolo-gies play an important role and are increasingly used intoday’s vehicles to improve the safety as well as comfort.However, to be effective, such technologies need to be humancentric and need to work in a holistic manner which takesinto account different components of the system involvingdriver (e.g. looking at driver to recognize driver activityand attention state), vehicle (e.g. looking at vehicle speed,steering angle, braking), and vehicle surround (e.g. lookingat road and other cars to understand surrounding situation)[1], [2]. Among those components, we consider in this paperthe part of looking at the driver which is a very important

component in driver assistance systems (it is shown that alarge portion of accidents is caused by human errors likedriver inattention or cognitive overload [3]. Therefore it isimportant for any intelligent vehicle to be able to interfacewith the driver to relay information (through any numberof feedback channels, e.g. auditory, visual, haptic) or detectsituations within the vehicle that might lead to unsafe driving(e.g. driver drowsiness or inattention). Often times the driverof a vehicle can be inattentive to the driving task resultingin dangerous driving situations. This inattentiveness can because by a variety of circumstances such as cell phone usage,drowsiness, or distractions caused by other occupants.

Another important aspects is emotional experience ofdriving and interacting with the vehicle interface, and how toapproach the design of vehicle interfaces in order to supportpositive (and avoid negative) emotional experiences [4], [5].Driving in particular presents a context in which a user’semotional state plays a significant role. Emotions have beenfound to affect cognitive style and performance. Even mildlypositive feeling can have a profound effect on the flexibilityand efficiency of thinking and problem solving The road-rage phenomenon [6] is a well recognized situation whereemotions can directly impact safety. Such phenomenon canbe mitigated and driving performance can be improved ifthe car actively responds to the emotional state of the driver.Research studies show that matching in-car voice and driver’semotional state has great impact on driving performance [7].Research emotion recognition is largely influenced by thebasic emotion theory [8]. Most of the existing efforts inthis direction aim at the recognition of a subset of basicemotions (happy, sad, surprise, anger, fear, disgust and neu-tral). In recent years, however, few studies have been madefocusing on certain application dependent affective states,‘frustration’, for example, in driving context [9]. Anothermodel for articulating emotions is that of Russell’s CoreAffect model [10] (Figure 1). Russell suggests to categorizeemotion by understanding in two dimensional space: valence(pleasure-displeasure) and arousal (activation-deactivation).[11] have used such model to analyze driver’s experienceduring interaction with the vehicle interface while drivingin high traffic. Literature also mentions other dimensions. Athird dimension often used is - dominance, which represents

Proceedings of International Joint Conference on Neural Networks, San Jose, California, USA, July 31 – August 5, 2011

978-1-4244-9636-5/11/$26.00 ©2011 IEEE 2997

Page 2: Audio Visual Cues in Driver Affect Characterization: Issues and Challenges …swiftlet.ucsd.edu/publications/2011/Tawari_IJCNN2011.pdf · 2011-12-21 · Audio Visual Cues in Driver

Fig. 1. 2-D approach for emotion analysis: valence-axis (pleasure-displeasure) and arousal-axis (activation-deactivation)

the scale between aggressive and submissive.In this paper, we review some of the existing approaches

for analyzing in-vehicle driver affect using audio and visualcues. We will discuss challenges in developing robust systemand hopefully provide some insight in practical realization ofsuch system.

III. AUDITORY CUES DURING DRIVING

In recent years, there has been a growing number of speechdriven applications in the car. Current research and attentiontheory have suggested that speech-based interactions are lessdistracting than interaction with visual display [12]. A carefuldesign of effective speech interface can also be potential aidfor a drowsy driver [13], [14]. However, [15] suggests thatthe only safe countermeasure against driving while sleepy isto stop driving. Using voice communication would have thefollowing advantages: microphones are nonobtrusive sensor,requires no/minimal calibration efforts, robust against ex-treme environmental conditions (humidity, temperature, andvibrations) and “hands- and eyes-free”, and most importantly,speech data are omnipresent in many daily life situations.There have been number of studies in recognizing driver’semotional state via speech signal to improve both safety andcomfort in the car.

A. Vocal expression recognition

In [14], driving simulator along with other peripherals(microphones and speakers) are employed for the analy-sis of emotion recognition. The system uses 10 acousticfeatures including pitch, volume, rate of speech and otherspectral coefficients along with statistical and neural networkclassifiers to recognize five emotional groups - boredom,sadness/grief, frustration/extreme anger, happiness and sur-prise. The recognition performance is assessed by comparingthe human emotion transcript against the output from theautomatic emotion recognition system using a 2 second win-dow analysis. Human Transcripts were generated by trainedexperts in recognizing affective cues in speech. They listened

to the speech soundtrack for each drive and classified into thesame five emotional groups used by the automatic emotionrecognition system. The experts themselves did not take partin the driver study. The qualitative analysis suggests that thesystem is capable of detecting driver’s emotion with sufficientaccuracy in order to support and improve driving. This study,however, ignores the presence of background noise. [16]studies the effects of additive white noise at various Signal-to-Noise-Ratio (SNR) on public emotional database. Authorsextract a large 4k hi-level features set out of more than60 base contours of pitch, energy, formants, Harmonic-to-Noise-Ratio (HNR), MFCC etc. from noisy speech. Further,Fast-Information-Gain-Ratio filter-selection is used to selectfeatures from the large feature set for each SNR conditions.As classifier, Support Vector Machines (SVMs) are trainedfrom each noisy conditions. The preliminary studies showedthat the recognition performance depends very much on SNRvalues. It is also to be noted that features sets differ largelyat various noise levels. [17] extends the noise analysis andshows the feasibility of detecting driver’s emotional state viaspeech in presence of various noise scenario. The speechsignal is superimposed with car noise of several car typesand various road surfaces. The emotion conveyed in noisyspeech signal was estimated using two steps. First, a set of 20acoustic features was extracted from the speech signal. These20 features were in turn determined by a feature selectiontechnique (Sequential Forward Selection) from a pool of137 features which includes features related to pitch, energy,duration and timing, and spectral features (as Mel FrequencyCepstral Coefficients). Second, kernel based Support VectorRegression (SVR) is used as classifier to map the featuresto the values of three emotion primitives (valence, activationand dominance). Training of classifiers are performed usingclean speech and noisy speech. Later provides a matchedtraining and test conditions and shows better performance.However, it has obvious disadvantage and is impractical toapply to real world applications since it is difficult to ensureidentical training and testing environments.

[18] presents a framework with adaptive noise cancellationas front end to speech emotion recognizer in application todriver assistance system. Such real-world environment posesrobustness issues as the system may be highly susceptible tonoise. To study the feasibility of emotion detection whiledriving, it emulates the vehicle conditions. To reflect thereality adequately, several noise scenarios of approximately60 seconds were recorded in the car while driving. Inte-rior noise depends on several factors such as vehicle type,road surfaces, outside environment etc. In this study, aninstrumented Infinity Q45 on-road testbed (Figure 2 is usedto record the interior noise in following scenario: highway(FWY), parking lot (PRK) and city street (CST). SupportVector Machine (SVM) is trained on prosody and spectralfeature extracted from clean speech. During testing phase,clean speech is superimposed by various types of noise atdifferent Signal-to-Noise-Ratio (SNR). The framework haveshown encouraging improvement for classification of three

2998

Page 3: Audio Visual Cues in Driver Affect Characterization: Issues and Challenges …swiftlet.ucsd.edu/publications/2011/Tawari_IJCNN2011.pdf · 2011-12-21 · Audio Visual Cues in Driver

(a) Top view (b) Inside view

Fig. 2. The LISA-P Experimental Testbed used for data collection and evaluation

emotion category - positive, negative and neutral. However,to deploy such system in real world, there still exists muchroom for improvement. [19] expands on the findings indriving scenario and also studies role of context like genderin improving the classification performance.

IV. VISUAL CUES DURING DRIVING

A. Facial expression analysis

Because of the importance of face in emotion expressionand perception, most of the vision-based affect recognitionstudies focus on facial expression analysis. Existing facialexpression recognizers employ various pattern recognitionapproaches and are based on 2D spatiotemporal facial fea-tures. Usually, the extracted facial features are either geomet-ric features such as the shapes of the facial components (eyes,mouth, etc.) and the location of facial salient points (cornersof the eyes, mouth, etc.) or appearance features representingthe facial texture, including wrinkles, bulges, and furrows.

Typical examples of geometric-feature-based methods arethose of Chang et al. [20], who used a shape model definedby 58 facial landmarks, of Pantic et al. [21] who used aset of facial characteristic points around the mouth, eyes,eyebrows, nose, and chin, and of Kotsia and Pitas [22],who used the Candide grid. Typical examples of appearance-feature-based methods are those of Bartlett et al. [23], whoused Gabor wavelets and of Whitehill and Omlin [24], whoused Haar features. Few studies proposed a hybrid approachby combining both geometric and appearance features fordesigning automatic facial expression recognizers. One suchmethod is of Zhang and Ji [25], who used 26 facial pointsaround the eyes, eyebrows, and mouth, and the transientfeatures like crow-feet wrinkles and nasal-labial furrows.

Most of these methods are suitable for the analysis offacial expressions under a small range of head motions andlighting conditions. Thus, most of these methods focus onthe recognition of facial expressions in near-frontal-view

recordings. In real world environment like driving scenario,however, such constraints are often invalid. A recent studyof [26] applies a rigid face shape model to build person-dependent descriptors that were later used to decomposefacial pose and expression simultaneously. Pantic and Patras[21] explored automatic analysis of facial expressions fromthe profile-view face. Other concerns for realizing practicalsystem are presence of occlusion (e.g. driver’s hands orother objects might obscure view of the driver), changinglighting conditions (e.g. entering tunnels or driving underoverpasses) as well as real-time performance. [27] develops apose-invariant real-time system using thin-plate spline featurevector for analyzing facial expression (Figure 3). [28] providea multilevel face and facial landmark tracking using particlefilter and a dynamic Bayesian network framework (Figure 4).The system first looks at the drivers head location, then faceregions, and so on down to the facial landmarks (eye corners,eyebrows, etc.) region. By using multiple cues, it providesrobustness to varying environmental and lighting conditions,and occlusion found inside vehicles. Although the system isapplied for driver intent analysis, similar approach can helpin facial expression recognition.

Most of these works on the automatic facial expressionrecognition are based on deliberate and often exaggeratedfacial displays. In recent years, although, several efforts havebeen reported on the automatic analysis of spontaneous facialexpression, their validation against driving scenario is still anopen area of research.

It would be important to mention an overlapping areaof research as far as visual expression recognition is con-cerned: detecting fatigue a potential cause of inattention anddistraction. Many efforts on the detection of driver fatiguehave specifically focused on changes and movements inthe eyes which involves assessing changes in the driver’sgaze, blink rate and eye closer. PERCLOS (Percent EyeClosure) is validated as an accurate psychophysical mea-

2999

Page 4: Audio Visual Cues in Driver Affect Characterization: Issues and Challenges …swiftlet.ucsd.edu/publications/2011/Tawari_IJCNN2011.pdf · 2011-12-21 · Audio Visual Cues in Driver

Fig. 3. Face expression recognition at different head rotation [27]

sure of performance degradation in sleep-deprived subjects.PERCLOS is defined as the proportion of total time that thesubject’s eyelids are closed 80% or more over a specifiedperiod of time. [29] utilizes structured illumination approachusing IR cameras. This simple technique can significantlyimproves eye tracking robustness and accuracy. However, italso suffers limitations. The success of the approach stronglydepends on the brightness and size of the pupils, whichare often functions of face orientation, external illumina-tion interference, and the distance of the subjects from thecamera. For real-world in-vehicle applications, sunlight caninterfere with IR illumination, reflection from eyeglasses cancreate confounding bright spots near the eyes and sunglassestend to disturb the IR light and make the pupils appearvery weak. An integrated Bayesian approach to specificallyaddress problems arising in dynamic situation like driving,is presented in [30]. For estimating attention, it incorporatesvision based gaze estimation (looking the driver) and visualsaliency maps (looking the environment) as well as cognitivemodels that affects relationship between gaze and attention.The study shows superiority of the approach over methodsusing only gaze or only saliency.

V. TOWARDS EFFECTIVE MULTIMODAL FUSION

Multimodal fusion strategies utilized in the existing re-search on audiovisual emotion recognition are feature-level,decision-level, or model-level fusion [31]. A good fusionscheme should have lower error rate than those obtainedfrom the unimodal models and performance of such systemshould degrade gracefully when noise affects its individualmodalities. In feature-level fusion schemes, features fromdifferent modalities are concatenated to form joint featurevectors, which are then used to train the emotion recognizer.However, the different time scales and metric levels offeatures coming from different modalities, as well as increas-ing feature-vector dimensions influence the performance.Moreover, missing or noisy information from one modalityresults in a significant reduction in the performance of therecognition system. Decision-level information fusion, onthe other hand, models each modality independently andunimodal recognition is combined in the end. It offers the

Fig. 4. Dynamic Bayesian network for facial landmark tracking [28]

advantage of robustness in case of failures in single modality.Human, however, displays audio and visual expressions ina complementary redundant manner. Hence the assumptionof conditional independence between audio and visual datastream in decision-level fusion is incorrect and results in lossof information of mutual correlation between two modalities.Moreover, these two modalities do not couple strongly intime. For example, during active speech segment, facialexpression (e.g. smile) may get influenced by the speechgeneration and becomes expressive only immediate afterspeech content.

A. Contextual Modeling and Hierarchical Fusion

In the design of the intelligent system, there exist many subtasks which need to be integrated to provide the situationalawareness required by the intelligent system. For example,a task such as emotion recognition, accomplished by fusingaudio and visual cues, can be benefited by gender recognitionand speaker identification whereas these tasks can be assistedby face recognition which in turn can use information fromforeground object (face region) detection. Thus, the audio-visual fusion occur at different levels and hence the namehierarchical fusion. When audio-visual fusion is exploredin the presence of such co-performed tasks, not only is anhierarchical integration of audio and video cues necessary,it is also beneficial to the performance of individual tasksbecause the output of one kind of task contains valuable con-textual information for another task and by interconnectingthem, a robust system results. We believe that the designof such hierarchical framework and context modeling arefundamental for practical realization of intelligent systems.

VI. NATURALISTIC DRIVING DATASETS

The collection of large-scale real-world driving data hasfundamental importance in the development of intelligentvehicular systems capable of interpreting and dealing withdifferent situations in traffic. Compared to driving simulatorrecordings, real-world data, although costly, are much more

3000

Page 5: Audio Visual Cues in Driver Affect Characterization: Issues and Challenges …swiftlet.ucsd.edu/publications/2011/Tawari_IJCNN2011.pdf · 2011-12-21 · Audio Visual Cues in Driver

Fig. 5. Driving simulator setup of LISA-S testbed. The testbed includesa PC-based driving simulator, with graphics shown on a 52-inch monitor,audio, and steering-wheel and pedal controller. The testbed includes a headand eye-gaze tracking system, vision based upper-body tracking system aswell as foot-movement monitoring camera. All the data from the gaze andbody (including foot) trackers are recorded synchronously with steering dataand other parameters from the driving simulator.

directly related to every-day conditions, providing valuableinformation on drivers’ behaviors in both predicted and un-predicted situations. Acquiring naturalistic driving data andthe related ground truth, however, is far from being solved.[9] proposed a transcription protocol for video, speech,driving behavior and physiological signal collected from 150drivers. [32] is developing audiovisual affect database inchallenging ambient of car settings as well as a software toolfor synchronized labeling for the ground truth (Figure 6) Onthe other hand, simulation environment can provide moreflexibility in configuring sensors and designing experimenttasks for more in depth analysis of scenarios which might bedifficult and/or unsafe for real world driving implementation.[33] studies role of facial features and vehicle features inpredicting driving accidents (major/minor) at varying pre-accident intervals (one-four seconds). Authors claim that,in case of minor accidents, the vehicle features prove moreuseful close to the accidents (one-two seconds before) whilethe face features are more predictive longer before theaccidents (three to four seconds). Combining face featuresand vehicle features, however, provides best performance inall accidents scenarios.

While working with real world driving testbed is ourultimate goal, the coordination between real world driv-ing and simulation environment is useful and in our ongoing efforts we take this into account when developingour system. Towards this end, we are actively developingsimulator testbed (LISA-S). Figure 5 shows LISA-S testbedconfigurations. The main monitor is configured to show aPC-based interactive open source “racing” simulator, TORCS[34]. The testbed includes an audio system, a head and eye-gaze tracking system as well as a vision based upper-bodyand foot tracking system. It is capable of recording all the

Fig. 6. Development platform for audiovisual scene synchronization,cropping and labeling. It has capability of voice activity detection (VAD)to decide precise onset of speech signal [32]

data from gaze and body trackers synchronously with thesteering data and other parameters from the driving simulator.

VII. DISCUSSION AND CONCLUDING REMARKS

Emotions affect many cognitive processes highly relevantto driving. Literature answers the question of the optimumemotional state with the statement ”happy drivers are betterdrivers” [35], [36]. Hence controlling the emotional state canensure a safer and more pleasant driving experience. Thefirst step for the design of future intelligent driver assistancesystem is to provide ability to detect subtleties of and changesin the driver’s affective behavior. Moreover such system,to be accepted and trusted by the driver, must be human-centered and based on naturally occurring modalities in hu-man communication. Since human perception is multimodalin nature, with speech and vision being the primary senses,significant research effort has been focused on developingintelligent systems with audio and video interfaces.

We presented series of studies analyzing affective statesusing audio and video cues, with the aim to bring outchallenges these systems have to face in the real worldenvironment. The great advantages of the speech modalityare low hardware costs and high reliability, and moreover,they already exists in today’s car. However, not all speechcaptured by the microphone may be relevant in such an openmicrophone scenario where auditory channel is susceptibleto noise making recognition tasks more difficult. Further,audio information simply is not continuously present if thedriver does not constantly speak which is a disadvantage forconstant and reliable driver monitoring. In contrast, visionmodality is omnipresent. However, it suffers from changinglighting conditions, occlusion and different pose for emotionrecognition approaches. An intelligent approach would be toincorporate both the modalities in a multimodal framework.It is evident that presence of redundant information greatlyimproves human performance, e.g reaction-time in a mul-

3001

Page 6: Audio Visual Cues in Driver Affect Characterization: Issues and Challenges …swiftlet.ucsd.edu/publications/2011/Tawari_IJCNN2011.pdf · 2011-12-21 · Audio Visual Cues in Driver

titasking driving simulator experiments [37]. In naturalisticaudiovisual affective behavior, temporal structures of themodalities (facial and vocal) and their temporal correlationsplay an extremely important role. Further significant ad-vantage of using multimodal sensors is the robustness toenvironment and sensor noise that can be achieved throughcareful integration of information from different types ofsensors. Yet another important observation from humanemotion perception is the use of contextual information.Towards this end, we presented a multilevel fusion scheme toexploit temporal structure in audio-visual modalities as wellas incorporate contextual information. In our future efforts,we will validate the proposed framework using simulator aswell as real world driving data.

REFERENCES

[1] M. M. Trivedi and S. Y. Cheng, “Holistic sensing and active displaysfor intelligent driver support systems,” IEEE Computer Magazine,2007.

[2] M. M. Trivedi, T. Gandhi, and J. McCall, “Looking-in and looking-outof a vehicle: Computer-vision-based enhanced vehicle safety,” IEEETransactions on Intelligent Transportation Systems, March 2007.

[3] “World report on road traffic injury prevention: Summary,” Technicalreport, World Health Organization, 2004.

[4] J. Angulo, “The emotional driver a study of the driving experience andthe road context,” Masters, Blekinge Institute of Technology, Ronneby,2007.

[5] M. L. Cummings, R. M. Kilgore, E. Wang, K. Tijerina, and D. S.Kochhar, “Effects of single versus multiple warnings on driver perfor-mance,” Human Factors, vol. 49, p. 10971106, 2007.

[6] T. E. Galovski and E. B. Blanchard, “Road rage: a domain forpsychological intervention?” Aggression and Violent Behavior, vol. 9,no. 2, pp. 105 – 127, 2004.

[7] I. marie Jonsson, C. Nass, H. Harris, and L. Takayama, “Matchingin-car voice with driver state: Impact on attitude and driving perfor-mance,” in Proc of International Driving Symposium on Human Factorin Driver Assessment, Training and Vehicle Design, 2005, pp. 173–181.

[8] P. Ekman, “An argument for basic emotions,” Cognition &Emotion, vol. 6, no. 3, pp. 169–200, 1992. [Online]. Available:http://dx.doi.org/10.1080/02699939208411068

[9] L. Malta, P. Angkititrakul, C. Miyajima, and K. Takeda, “Multi-modalreal-world driving data collection, transcription, and integration usingbayesian network,” in Intelligent Vehicles Symposium, 2008 IEEE,2008, pp. 150 –155.

[10] J. A. Russell, “Core affect and the psychological construction ofemotion,” Psychological Review, vol. 110, no. 1, pp. 145–172, 2003.

[11] G. R. E., V. Popovic, , and S. Bucolo, “Driving experience andthe effect of challenging interactions in high traffic context,” inProceedings of the Futureground International Conference, 2004.

[12] H. Lunenfeld, “Human factor considerations of motorist navigation andinformation systems,” in Vehicle Navigation and Information SystemsConference, 1989.

[13] F. Eyben, M. Wollmer, T. Poitschke, B. Schuller, C. Blaschke,B. Farber, and N. Nguyen-Thien, “Emotion on the road: necessity,acceptance, and feasibility of affective computing in the car,” Adv. inHum.-Comp. Int., vol. 2010, pp. 5:1–5:17, January 2010. [Online].Available: http://dx.doi.org/10.1155/2010/263593

[14] C. M. Jones and I.-M. Jonsson, “Performance analysis of acousticemotion recognition for in-car conversational interfaces,” in HCI (6),2007, pp. 411–420.

[15] J. Lyznicki, T. Doege, R. Davis, and M. Williams, “Sleepiness, driving,and motor vehicle crashes. council on scientific affairs, americanmedical association.” JAMA, vol. 279, no. 23, pp. 1908–13, 1998.

[16] B. Schuller, D. Arsic, F. Wallhoff, and G. Rigoll, “Emotion recognitionin the noise applying large acoustic feature sets,” in Proceedings ofSpeech Prosody, 2006.

[17] M. Grimm, K. Kroschel, H. Harris, C. Nass, B. Schuller, G. Rigoll,and T. Moosmayr, “On the necessity and feasibility of detecting adriver’s emotional state while driving,” in ACII, 2007, pp. 126–138.

[18] A. Tawari and M. M. Trivedi, “Speech emotion analysis in noisyreal world environment,” in Proceedings of the 20th InternationalConference on Pattern Recognition (ICPR), 2010.

[19] A. Tawari and M. Trivedi, “Speech based emotion classification frame-work for driver assistance system,” in Intelligent Vehicles Symposium(IV), 2010 IEEE, 21-24 2010, pp. 174 –178.

[20] Y. Chang, C. Hu, R. Feris, and M. Turk, “Manifold based analysisof facial expression,” J. Image and Vision Computing, vol. 24, pp.150–155, 2006.

[21] M. Pantic and I. Patras, “Dynamics of facial expression: Recognitionof facial actions and their temporal segments form face profile imagesequences,” IEEE Trans. Systems, Man, and Cybernetics Part B,vol. 36, pp. 433–449, 2006.

[22] I. Kotsia and I. Pitas, “Facial expression recognition in image se-quences using geometric deformation features and support vectormachines,” IEEE Trans. Image Processing, vol. 16, pp. 172–187, 2007.

[23] M. Bartlett, G. Littlewort, P. Braathen, T. Sejnowski, and J. Movellan,“A prototype for automatic recognition of spontaneous facial actions,”Advances in Neural Information Processing Systems, vol. 15, pp.1271–1278, 2003.

[24] J. Whitehill and C. Omlin, “Haar features for facs au recognition,”Proc. IEEE Intl Conf. Automatic Face and Gesture Recognition (AFGR06), pp. 217–222, 2006.

[25] Y. Zhang and Q. Ji, “Active and dynamic information fusion for facialexpression understanding from image sequences,” IEEE Trans. PatternAnalysis and Machine Intelligence, vol. 27, pp. 699–714, 2005.

[26] S. Kumano, K. Otsuka, J. Yamato, E. Maeda, and Y. Sato, “Pose-invariant facial expression recognition using variable-intensity tem-plates,” International Journal of Computer Vision, vol. 83, pp. 178–194, 2009.

[27] J. C. McCall and M. M. Trivedi, “Pose invariant affect analysis usingthin-plate splines,” Pattern Recognition, International Conference on,vol. 3, pp. 958–964, 2004.

[28] J. McCall and M. M. Trivedi, “Driver monitoring for a human-centered driver assistance system,” in Proceedings of the 1st ACMinternational workshop on Human-centered multimedia, ser. HCM’06. New York, NY, USA: ACM, 2006, pp. 115–122. [Online].Available: http://doi.acm.org/10.1145/1178745.1178764

[29] R. Grace, “Drowsy driver monitor and warning system,” in Interna-tional Driving Symposium on Human Factors in Driver Assessment,Training and Vehicle Design, August 2001.

[30] A. Doshi and M. Trivedi, “Attention estimation by simultaneousobservation of viewer and view,” in Computer Vision and PatternRecognition Workshops (CVPRW), 2010 IEEE Computer Society Con-ference on, june 2010, pp. 21 –27.

[31] S. Shivappa, M. Trivedi, and B. Rao, “Audiovisual information fusionin human computer interfaces and intelligent environments: A survey,”Proceedings of the IEEE, vol. 98, no. 10, pp. 1692 –1715, oct 2010.

[32] A. Tawari and M. M. Trivedi, “Context analysis in speech emotionrecognition,” IEEE Transaction on Multimedia, 2010.

[33] M. Jabon, J. Bailenson, E. Pontikakis, and C. Nass, “Facial expressionanalysis for predicting unsafe driving behavior,” Pervasive Computing,IEEE, vol. PP, no. 99, p. 1, 2010.

[34] “The open racing car simulator website.” [Online]. Available:http://torcs.sourceforge.net

[35] C. M. Jones and I.-M. Jonsson, “Automatic recognition ofaffective cues in the speech of car drivers to allow appropriateresponses,” in Proceedings of the 17th Australia conference onComputer-Human Interaction: Citizens Online: Considerations forToday and the Future, ser. OZCHI ’05. Narrabundah, Australia,Australia: Computer-Human Interaction Special Interest Group(CHISIG) of Australia, 2005, pp. 1–10. [Online]. Available:http://portal.acm.org/citation.cfm?id=1108368.1108397

[36] G. Underwood, “Understanding driving: applying cognitivepsychology to a complex everyday task. john a. groeger. psychologypress, hove, 2000. no. of pages 254. isbn 0-415-18752-4. price45.00 (hardback),” Applied Cognitive Psychology, vol. 16, no. 3, pp.363–365, 2002. [Online]. Available: http://dx.doi.org/10.1002/acp.810

[37] J. Levy and H. Pashler, “Task prioritisation in multitasking duringdriving: opportunity to abort a concurrent task does not insulatebraking responses from dual-task slowing,” Applied CognitivePsychology, vol. 22, no. 4, pp. 507–525, 2008. [Online]. Available:http://dx.doi.org/10.1002/acp.1378

3002


Recommended