Automatic Facial Expression Analysis a Survey

Pattern Recognition 36 (2003) 259–275www.elsevier.com/locate/patcog

Automatic facial expression analysis: a survey

B. Fasela ;∗, Juergen Luettinb

aIDIAP—Dalle Molle Institute for Perceptual Arti�cial Intelligence, Rue du Simplon 4, CH-1920 Martigny, SwitzerlandbAscom Systec AG, Applicable Research and Technology, Gewerbepark CH-5506, Maegenwil, Switzerland

Received 15 May 2001; accepted 15 February 2002

Abstract

Over the last decade, automatic facial expression analysis has become an active research area that 0nds potential applicationsin areas such as more engaging human–computer interfaces, talking heads, image retrieval and human emotion analysis. Facialexpressions re2ect not only emotions, but other mental activities, social interaction and physiological signals. In this survey,we introduce the most prominent automatic facial expression analysis methods and systems presented in the literature. Facialmotion and deformation extraction approaches as well as classi0cation methods are discussed with respect to issues such asface normalization, facial expression dynamics and facial expression intensity, but also with regard to their robustness towardsenvironmental changes. ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.

Keywords: Facial expression recognition; Facial expression interpretation; Emotion recognition; A7ect recognition; FACS

1. Introduction

Facial expression analysis goes well back into the nine-teenth century. Darwin [1] demonstrated already in 1872the universality of facial expressions and their continuityin man and animals and claimed among other things, thatthere are speci0c inborn emotions, which originated in ser-viceable associated habits. In 1971, Ekman and Friesen [2]postulated six primary emotions that possess each a distinc-tive content together with a unique facial expression. Theseprototypic emotional displays are also referred to as socalled basic emotions. They seem to be universal acrosshuman ethnicities and cultures and comprise happiness,sadness, fear, disgust, surprise and anger. In the past, facialexpression analysis was primarily a research subject forpsychologists, but already in 1978, Suwa et al. [3] presenteda preliminary investigation on automatic facial expressionanalysis from an image sequence. In the nineties, automaticfacial expression analysis research gained much inertia

∗Corresponding author.E-mail addresses: [email protected] (B. Fasel),

[email protected] (J. Luettin).

starting with the pioneering work of Mase and Pentland[4]. The reasons for this renewed interest in facial expres-sions are multiple, but mainly due to advancements accom-plished in related research areas such as face detection, facetracking and face recognition as well as the recent avail-ability of relatively cheap computational power. Variousapplications using automatic facial expression analysis canbe envisaged in the near future, fostering further interestin doing research in di7erent areas, including image un-derstanding, psychological studies, facial nerve grading inmedicine [5], face image compression and synthetic faceanimation [6], video-indexing, robotics as well as virtual re-ality. Facial expression recognition should not be confusedwith human emotion recognition as is often done in the com-puter vision community. While facial expression recogni-tion deals with the classi0cation of facial motion and fa-cial feature deformation into abstract classes that are purelybased on visual information, human emotions are a result ofmany di7erent factors and their state might or might not berevealed through a number of channels such as emotionalvoice, pose, gestures, gaze direction and facial expressions.Furthermore, emotions are not the only source of facial ex-pressions, see Fig. 1. In contrast to facial expression recog-nition, emotion recognition is an interpretation attempt and

0031-3203/02/$22.00 ? 2002 Pattern Recognition Society. Published by Elsevier Science Ltd. All rights reserved.PII: S0031 -3203(02)00052 -3

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Abhishek

Highlight

260 B. Fasel, J. Luettin / Pattern Recognition 36 (2003) 259–275

ListenerResponses

Conviction

CommunicationNon-VerbalMental States

Physiological

Cogitation

Activities

Felt Emotions

Manipulators

Facial Expressions

Verbal Communication

Unfelt Emotions

Pain

Social WinksEmblems

Tiredness Illustrators Regulators

Fig. 1. Sources of facial expressions.

often demands understanding of a given situation, togetherwith the availability of full contextual information.

2. Facial expression measurement

Facial expressions are generated by contractions of facialmuscles, which results in temporally deformed facial fea-tures such as eye lids, eye brows, nose, lips and skin texture,often revealed by wrinkles and bulges. Typical changes ofmuscular activities are brief, lasting for a few seconds, butrarely more than 5 s or less than 250 ms. We would like toaccurately measure facial expressions and therefore need auseful terminology for their description. Of importance isthe location of facial actions, their intensity as well as theirdynamics. Facial expression intensities may be measured bydetermining either the geometric deformation of facial fea-tures or the density of wrinkles appearing in certain faceregions. For example the degree of a smiling is communi-cated by the magnitude of cheek and lip corner raising aswell as wrinkle displays. Since there are inter-personal vari-ations with regard to the amplitudes of facial actions, it isdiFcult to determine absolute facial expression intensities,without referring to the neutral face of a given subject. Notethat the intensity measurement of spontaneous facial ex-pressions is more diFcult in comparison to posed facial ex-pressions, which are usually displayed with an exaggeratedintensity and can thus be identi0ed more easily. Not only thenature of the deformation of facial features conveys mean-ing, but also the relative timing of facial actions as well astheir temporal evolution. Static images do not clearly re-veal subtle changes in faces and it is therefore essential tomeasure also the dynamics of facial expressions. Althoughthe importance of correct timing is widely accepted, onlya few studies have investigated this aspect systematically,mostly for smiles [7]. Facial expressions can be described

with the aid of three temporal parameters: onset (attack),apex (sustain), o<set (relaxation). These can be obtainedfrom human coders, but often lack precision. Few studiesrelate to the problem of automatically computing the onsetand o7set of facial expressions, especially when not relyingon intruding approaches such as Facial EMG [8]. There aretwo main methodological approaches of how to measure theaforementioned three characteristics of facial expressions,namely message judgment based and sign vehicle-based ap-proaches [9]. The former directly associate speci0c facialpatterns with mental activities, while the latter representfacial actions in a coded way, prior to eventual interpreta-tion attempts.

2.1. Judgment-based approaches

Judgment-based approaches are centered around the mes-sages conveyed by facial expressions. When classifying fa-cial expressions into a prede0ned number of emotion ormental activity categories, an agreement of a group of codersis taken as ground truth, usually by computing the aver-age of the responses of either experts or non-experts. Mostautomatic facial expression analysis approaches found inthe literature attempt to directly map facial expressions intoone of the basic emotion classes introduced by Ekman andFriesen [2].

2.2. Sign-based approaches

With sign vehicle-based approaches, facial motion anddeformation are coded into visual classes. Facial actions arehereby abstracted and described by their location and inten-sity. Hence, a complete description framework would ide-ally contain all possible perceptible changes that may occuron a face. This is the goal of facial action coding system(FACS), which was developed by Ekman and Friesen [10]and has been considered as a foundation for describing fa-cial expressions. It is appearance-based and thus does notconvey any information about e.g. mental activities associ-ated with expressions. FACS uses 44 action units (AUs) forthe description of facial actions with regard to their locationas well as their intensity, the latter either with three or 0velevels of magnitude. Individual expressions may be modeledby single action units or action unit combinations. Similarcoding schemes are EMFACS [11], MAX [12] and AFFEX[13]. However, they are only directed towards emotions. Fi-nally, the MPEG-4-SNHC [6] is a standard that encompassesanalysis, coding [14] and animation of faces (talking heads)[15]. Instead of describing facial actions only with the aidof purely descriptive AUs, scores of sign-based approachesmay be interpreted by employing facial expression dictio-naries. Friesen and Ekman introduced such a dictionary forthe FACS framework [16]. Ekman et al. [17] presented alsoa database called facial action coding system a7ect interpre-tation database (FACSAID), which allows to translate emo-tion related FACS scores into a7ective meanings. Emotion

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

B. Fasel, J. Luettin / Pattern Recognition 36 (2003) 259–275 261

interpretations were provided by several experts, but onlyagreed a7ects were included in the database.

2.3. Reliability of ground truth coding

The labeling of employed databases determines not only,whether a given system attempts to recognize or interpretfacial expressions, but may also in2uence the achievablerecognition accuracy, especially when it comes to facialexpression timing and intensity estimations. Furthermore,chosen classi0cation schemes a7ect the design of facial ex-pression classi0ers, e.g. they have an in2uence on the num-ber and nature of facial action categories that have to betreated. According to Ekman [18], there are several pointsthat need to be addressed when measuring facial expres-sions: (a) a separate agreement index about the scoring ofspeci0c facial actions, as typically some actions are easier torecognize than others, (b) spontaneous rather than posed fa-cial actions, (c) various subjects including infants, children,adults and aged populations, (d) limiting the disagreementin the judgment of facial actions by providing a minimal in-tensity threshold of facial actions, (e) inclusion of both ex-pert and beginners for the measurement of facial actions and(f) the reliability should be reported not only for the type,but also the intensity and dynamics of facial actions. Thesepoints can probably be easier ful0lled with sign than withjudgment-based approaches as the latter can only provide alimited labeling accuracy. For example, within a single ba-sic emotion category, there is too much room for interpre-tation. Cross-cultural studies have shown furthermore, thatthe judgment of facial expression is also culturally depen-dent and partially in2uenced by learned display rules [19].Even though the aforementioned basic emotions are univer-sal across cultures, the assessment is hampered, if the en-coder and decoder are of di7erent cultures [20]. Sign-basedcoding schemes on the other hand increase objectivity, ascoders are only required to record speci0c concerted facialcomponents instead of performing facial expression inter-pretation. An advantage of sign-based methods is also thepossibility of decomposing facial expression recognition andfacial expression interpretation. Hence, the performance ofthe employed analysis methods may be evaluated directlywith regard to their visual performance.

3. Automatic facial expression analysis

Automatic facial expression analysis is a complex taskas physiognomies of faces vary from one individual to an-other quite considerably due to di7erent age, ethnicity, gen-der, facial hair, cosmetic products and occluding objectssuch as glasses and hair. Furthermore, faces appear disparatebecause of pose and lighting changes. Variations such asthese have to be addressed at di7erent stages of an automaticfacial expression analysis system, see Fig. 2. We have a

closer look at the individual processing stages in the remain-der of this chapter.

3.1. Face acquisition

Ideally, a face acquisition stage features an automaticface detector that allows to locate faces in complex sceneswith cluttered backgrounds. Certain face analysis meth-ods need the exact position of the face in order to extractfacial features of interest while others work, if only thecoarse location of the face is available. This is the case withe.g. active appearance models [21]. Hong et al. [22] used thePersonSpotter system by Ste7ens et al. [23] in order toperform realtime tracking of faces. The exact face dimen-sions were then obtained by 0tting a labeled graph onto thebounding box containing the face previously detected bythe PersonSpotter system. Essa and Pentland [24] locatedfaces by using the view-based and modular eigenspacemethod of Pentland et al. [25]. Face analysis is complicatedby face appearance changes caused by pose variations andillumination changes. It might therefore be a good idea tonormalize acquired faces prior to their analysis:

• Pose: The appearance of facial expressions depends onthe angle and distance at which a given face is beingobserved. Pose variations occur due to scale changes aswell as in-plane and out-of-plane rotations of faces. Es-pecially out-of-plane rotated faces are diFcult to handle,as perceived facial expression are distorted in compari-son to frontal face displays or may even become partlyinvisible. Limited out-of-plane rotations can be addressedby warping techniques, where the center positions of dis-tinctive facial features such as the eyes, nose and mouthserve as reference points in order to normalize test facesaccording to some generic face models e.g. see Ref. [24].Scale changes of faces may be tackled by scanning im-ages at several resolutions in order to determine the sizeof present faces, which can then be normalized accord-ingly [24,26].

• Illumination: A common approach for reducing lightingvariations is to 0lter the input image with Gabor wavelets,e.g. see Ref. [27]. The problem of partly lightened facesis diFcult to solve. This has been addressed for the taskof face recognition by Belhumeur et al. [28], but not yetsuFciently for facial expression analysis. Finally, specu-lar re2ections on eyes, teeth and wet skin may be encoun-tered by using brightness models [29].

Note that even though face normalization may be a reason-able approach in conjunction with some face analysis ap-proaches, it is not mandatory, as long as extracted featureparameters are normalized prior to their classi0cation. In-deed, appearance-based model [21] and local motion model[30] approaches have dealt with signi0cant out-of-plane ro-tations without relying on face normalization.

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight


BasicEmotions

Mental Activities

7

Facial Actions intoMapping of Coded

6

Facial Features

Muscle-basedIllumination

Scale

Model Parameters

Classification

Pattern

Face

Facial Features Separation Component Projection

Tracking

Motion

Extraction

Facial Feature

Pose

Segmentation

Face Aquisition

RepresentationFacial Feature

Motion

Background Separation

Face Normalization

Whole Face

Facial Expression

ModelsCoding Into

Facial ActionsFeature Point

Images

Interpretation

Extraction

Image-based

Optical FlowDense

Extraction

Tracking

Deformation

Difference-

Model-based

Recognition

2

1

3

5 84

Fig. 2. Generic facial expression analysis framework. The encircled numbers are used in the system diagrams presented further below andindicate relevant processing stages. Note that the face normalization, the face segmentation as well as the facial feature representation stagesare only necessary in conjunction with some speci0c facial feature extraction and classi0cation methods.

Table 1Facial feature extraction methods: Overview of prominent deformation and motion extraction methods, which were used for the task offacial expression analysis

Deformation extraction Holistic methods Local methods

Image-based Neural network [27,31,32] Intensity pro0les [33]Gabor wavelets [27,31] High gradient components [34]

PCA + Neural networks [35,36]Model-based Active appearance model [21,37,38] Geometric face model [39]

Point distribution model [40] Two view point-based models [41]Labeled graphs [22,42,43]

Motion extraction Holistic methods Local methods

Dense optical 2ow Dense 2ow 0elds [33,34] Region-based 2ow [4,44,45]Motion models 3D motion models [24,46] Parametric motion models [30,47]

3D deformable models [48] 3D motion models [49]Feature point tracking Feature tracking [34,50–53]Di7erence-images Holistic di7.-imgs [27,33,54,55] Region-based di7erence-images [56]Marker-based Highlighted facial features [57]

Dot markers [3,58]

3.2. Feature extraction and representation

Feature extraction methods can be categorized accordingto whether they focus on motion or deformation of facesand facial features, respectively, whether they act locally orholistically. Table 1 gives an overview of methods that wereemployed by the computer vision community for the task offacial expression analysis.

3.2.1. Local versus holistic approachesFacial feature processing may take place either holisti-

cally, where the face is processed as a whole, or locally, byfocusing on facial features or areas that are prone to change

with facial expressions. We can distinguish two types of fa-cial features:

• Intransient facial features are always present in the face,but may be deformed due to facial expressions. Amongthese, the eyes, eyebrows and the mouth are mainly in-volved in facial expression displays. Tissue texture, facialhair as well as permanent furrows constitute other typesof intransient facial features that in2uence the appearanceof facial expressions.

• Transient facial features encompass di7erent kind ofwrinkles and bulges that occur with facial expressions.Especially, the forefront and the regions surrounding themouth and the eyes are prone to contain transient facial

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight

Yasser

Highlight


features. Opening and closing of eyes and the mouth mayfurthermore lead to iconic changes [29], local changes oftexture that cannot be predicted from antecedent frames.

Face segmentation allows to isolate transient and intran-sient features within faces or can be used to separate facesof interest from the background. Segmentation boundarieswere often determined heuristically, where guidance wasprovided by a priori knowledge of human observers. Notethat holistic feature extraction methods are good at deter-mining prevalent facial expressions, whereas local methodsare able to detect subtle changes in small areas. The latter aresuitable especially with rule-based interpretation attempts,e.g. see Ref. [41].

3.2.2. Deformation versus motion-based approachesMotion extraction approaches directly focus on fa-

cial changes occurring due to facial expressions, whereasdeformation-based methods do have to rely on neutral faceimages or face models in order to extract facial features thatare relevant to facial actions and not caused by e.g. intran-sient wrinkles due to old age. In contrast to motion-basedapproaches, deformation-based methods can be applied toboth single images and image sequences, in the latter caseby processing frames independently from each other. How-ever, deformation-based feature extractors miss low-leveldirectional 2ow information, i.e. they cannot reconstructpixel motion. Nonetheless, high-level motion of intransientfacial features may be inferred by using e.g. face and fa-cial feature models that allow to estimate possible 2owdirections.

3.2.3. Image versus model-based approachesImage-based methods extract features from images with-

out relying on extensive knowledge about the object ofinterest. They have the advantage of being typically fast andsimple. However, image-based approaches can become un-reliable and unwieldy, when there are many di7erent viewsof the same object that must be considered. The facial struc-ture can also be described with the aid of 2D or 3D facemodels. The former allows to model facial features and facesbased on their appearance, without attempting to recover thevolumetric geometry of the scene, e.g. see Ref. [21]. Thereare two types of 3D models, namely muscle and motionmodels [24,59]. The latter can signi0cantly improve the pre-cision of motion estimations, since only physically possiblemotion is considered. However, 3D models often requirecomplex mapping procedures that generate heavy compu-tational requirements. In addition, accurate head and facemodels have to be constructed manually, which is a tediousundertaking.

3.2.4. Appearance versus muscle-based approachesIn contrast to appearance-based image and 2D model ap-

proaches, where processing focuses on the e7ects of facialmuscle activities, muscle-based based frameworks attempt

to interfere muscle activities from visual information. Thismay be achieved e.g. by using 3D muscle models that allowto map extracted optical 2ow into muscle actions [59,60].Modeled facial motion can hereby be restricted to muscleactivations that are allowed by the muscle framework, giv-ing control over possible muscle contractions, relaxation andorientation properties. However, the musculature of the faceis complex, 3D information is not readily present and mus-cle motion is not directly observable. For example, there areat least 13 groups of muscles involved in the lip movementsalone [61]. Mase and Pentland [4] did not use complex 3Dmodels to determine muscle activities. Instead they trans-lated 2D motion in prede0ned windows directly into a coarseestimate of muscle activity. Muscle-based approaches arenot only well suited for the recognition of facial expression,but are also used to animate synthetic faces.

3.3. Deformation extraction

Deformation of facial features are characterized by shapeand texture changes and lead to high spatial gradients that aregood indicators for facial actions and may be analyzed eitherin the image or the spatial frequency domain. The latter canbe computed by high-pass gradient or Gabor wavelet-based0lters, which closely model the receptive 0eld properties ofcells in the primary visual cortex [62,63]. They allow to de-tect line endings and edge borders over multiple scales andwith di7erent orientations. These features reveal much aboutfacial expressions, as both transient and intransient facialfeatures often give raise to a contrast change with regard tothe ambient facial tissue. As we have mentioned before, Ga-bor 0lters remove also most of the variability in images thatoccur due to lighting changes. They have shown to performwell for the task of facial expression analysis and were usedin image-based approaches [27,31,33] as well as in com-bination with labeled graphs [22,42,43]. For an illustrationof Gabor 0lters applied to face images see Fig. 3. We candistinguish local or holistic image-based deformation ex-traction approaches:

• Holistic image-based approaches: Several authors havetaken either whole faces as features [27,31,32] or Gaborwavelet 0ltered whole-faces [27,31]. The main empha-sis is hereby put on the classi0er, which has to dealnot only with face physiognomies, but in the case ofimage-domain-based face processing also with light-ing variations. Common for most holistic face analysisapproaches is the need of a thorough face-backgroundseparation in order to prevent disturbance causedby clutter.

• Local image-based approaches: Padgett and Cottrell [35]as well as Cottrell and Metcalfe [36] extracted facial ex-pressions from windows placed around intransient facialfeature regions (both eyes and mouth) and employed localprincipal component analysis (PCA) for representationpurposes. Local transient facial features such as wrinkles


Test Faces

AU 1 + AU 4

AU 12

Gabor Representations - At one given resolution

Fig. 3. Facial feature extraction using Gabor wavelets: Shown aretwo distinct facial expression displays on the left-hand side withthe corresponding Gabor representations on the right-hand side.The latter were obtained by convoluting the face images on theleft-hand side with four di7erently oriented wavelet kernels (at onegiven resolution).

can be measured by using image intensity pro0les alongsegments [33] or by determining the density of high gra-dient components over windows of interest [34].

Model-based approaches constitute an alternative toimage-based deformation extraction. Appearance-basedmodel approaches allow to separate fairly well di7erentinformation sources such as facial illumination and defor-mation changes. Lanitis et al. [21] interpreted face imagesby employing active appearance models (AAM) [37,38].Faces were analyzed by a dual approach, using both shapeand texture models. Active shape models (ASM) allow tosimultaneously determine shape, scale and pose by 0ttingan appropriate point distribution model (PDM) to the ob-ject of interest, see Fig. 4. A drawback of appearance-basedmodels is the manual labor necessary for the constructionof the shape models. The latter are based on landmarkpoints that need to be precisely placed around intransientfacial features during the training of the models. Huangand Huang [40] used a point distribution model to rep-resent the shape of a face, where shape parameters wereestimated by employing a gradient-based method. Anothertype of holistic face models constitute the so-called labeledgraphs, which are comprised of sparsely distributed 0du-cial feature points [22,42,43]. The nodes of these featuregraphs consist of Gabor jets, where each component of ajet is a 0lter response of a speci0c Gabor wavelet extractedat a given image point. A labeled graph is matched to atest face by varying its scale and position. The obtainedgraph can then be compared to reference graphs in order todetermine the facial expression display at hand. Kobayashiand Hara [39] used a geometric face model consisting of30 facial characteristic points (FCP). They measured theintensity distribution along 13 vertical FCPs crossing faciallines with the aid of a neural network. Finally, Pantic andRothkrantz [41] used a 2D point-based model composed

Fig. 4. Facial feature representation using active shape models(ASM): The 0rst row shows manually placed point models (PM)that were employed to create a point distribution model (PDM),represented by a few discrete instances of two point distributionmodes shown in row two (mode 1) and three (mode 2), withintensities ranging from −3 to +3. The point distribution modeswere computed using the active shape model toolbox implementedby Matthews [64].

of both frontal and a side views. Multiple feature detectorswere applied redundantly in order to localize contours ofprominent facial features prior to their modeling.

3.4. Motion extraction

Among the motion extraction methods that have beenused for the task of facial expression analysis we 0nd denseoptical @ow, feature point tracking and di<erence-images.Dense optical 2ow has been applied both locally and holis-tically:

• Holistic dense optical @ow approaches allow forwhole-faces analysis and were employed e.g. inRefs. [33; 34]. Lien [34] analyzed holistic face motionwith the aid of wavelet-based, multi-resolution dense op-tical 2ow. For a compacter representation of the resulting2ow 0elds they computed PCA-based eigen2ows bothin horizontal and vertical directions. Fig. 5 shows sam-ple dense optical 2ow 0elds, computed from two facialexpression sequences.

• Local dense optical @ow: Region-based dense opti-cal 2ow was used by Mase and Pentland [4] in orderto estimate the activity of 12 of the totally 44 facialmuscles. For each muscle, a window in the face im-age was de0ned as well as an axis along which eachmuscle expands and contracts. Dense optical 2ow mo-tion was quanti0ed into eight directions and allowedfor a coarse estimation of muscle activity. Otsuka andOhya [44] estimated facial motion in local regions sur-rounding the eyes and the mouth. Feature vectors wereobtained by taking 2D Fourier transforms of the verticaland horizontal optical 2ow 0elds. Yoneyama et al. [45]


Fig. 5. Facial motion extraction using dense optical 2ow: Shown aretwo sample facial expression sequences on the left-hand side andthe corresponding optical 2ow images on the right-hand side, whichwere computed with Nagel’s algorithm [65]. Note the asymmetricfacial action display in the lower facial expression sequence.

divided normalized test faces into 8 × 10 regions, wherelocal dense optical 2ow was computed and quanti0edregion-wise into ternary feature vectors (+1=0= − 1),indicating upwards, none and downwards movements,while neglecting horizontal facial movements.

Di7erent optical 2ow algorithms have been applied to fa-cial motion analysis. For instance, Lien et al. [34] employedWu’s [66] approach of optical 2ow to estimate facial motionby using scaling functions and wavelets from Cai and Wang[67] to capture both local and global facial characteristics.Essa and Pentland [24] used Simoncelli’s [68] coarse-to-0neoptical 2ow, while Yacoob and Davis [47] as well as Rosen-blum et al. [53] employed Abdek-Mottaleb et al. [69] opti-cal 2ow. Apart from a certain vulnerability to image noiseand non-uniform lighting, holistic dense optical 2ow meth-ods often result in prodigious computational requirementsand tend to be sensitive to motion discontinuities (iconicchanges) as well as non-rigid motion. Optical 2ow analysiscan also be done in conjunction with motion models thatallow for increased stability and better interpretation of ex-tracted facial motion, e.g. muscle activations:

• Holistic motion models: Terzopoulos and Waters [70]have used 11 principal deformable contours (also knownas “snakes”) to track lip and facial features throughout im-age sequences with the aid of a force 0eld, which is com-puted from gradients found in the images. Only frontalfaces were allowed and some facial make-up was used toenhance contrast. Essa and Pentland [24] employed so-phisticated 3D motion and muscle models for facial ex-pression recognition and increased tracking stability by

Kalman 0ltering. DeCarlo and Metaxas [48] presented aformal methodology for the integration of optical 2owand 3D deformable models and applied it to human faceshapes and facial motion estimation. A relatively smallnumber of parameters were used to describe a rich varietyof face shapes and facial expressions. Eisert and Girod[46] employed 3D face models to specify shape, textureand motion. These models were also used to describe fa-cial expressions caused by speech and were parameterizedby FAPs of the MPEG-4 coding scheme.

• Local motion models: Black and Yacoob [30] as well asYacoob and Davis [47] introduced local parametric mo-tion models that allow, within local regions in space andtime, to not only accurately model non-rigid facial mo-tions, but to provide also a concise description of the mo-tion associated with the edges of the mouth, nose, eyelidsand eyebrows in terms of a small number of parameters.However, the employed motion models focus on the mainintransient facial features involved with facial expressions(eyes, eye-brows and mouth) and the analysis of transientfacial features, occurring in residual facial areas, was notconsidered. Last but not least, Basu et al. [49] presented aconvincing approach of how to track human lip motionsby using 3D models.

In contrast to low-level dense optical 2ow, there are alsohigher level variants that focus either on the movement ongeneric features points, patterns or markers:

• Feature point tracking: Here, motion estimates are ob-tained only for a selected set of prominent features suchas intransient facial features [34,50,51]. In order to re-duce the risk of tracking loss, feature points are placedinto areas of high contrast, preferably around intransientfacial features as is illustrated on the right-hand side ofFig. 6. Hence, the movement and deformation of thelatter can be measured by tracking the displacement ofthe corresponding feature points. Motion analysis is di-rected towards objects of interest and therefore does nothave to be computed for extraneous background patterns.However, as facial motion is extracted only at selectedfeature point locations, other facial activities are ignoredaltogether. The automatic initialization of feature points isdiFcult and was often done manually. Otsuka and Ohya[52] presented a feature point tracking approach, wherefeature points are not selected by human expertise, butchosen automatically in the 0rst frame of a given facialexpression sequence. This is achieved by acquiring po-tential facial feature points from local extrema or saddlepoints of luminance distributions. Tian et al. [50] used dif-ferent component models for the lips, eyes, brows as wellas cheeks and employed feature point tracking to adaptthe contours of these models according to the deforma-tion of the underlying facial features. Finally, Rosenblumet al. [53] tracked rectangular, facial feature enclosingregions of interest with the aid of feature points.


Fig. 6. Marker versus feature point tracking: On the left-handside are shown two faces with aFxed markers. The correspond-ing extracted marker patterns is depicted in the next column.Scale-normalized distances between marker points allow to deter-mine underlying muscle activities [58]. A marker-less feature pointtracking approach is shown on the right-hand side. Here, six fea-ture points 10 × 10 pixel windows were used to determine thedisplacement of the eyebrows.

• Marker tracking: It is possible to determine facial ac-tions with more reliability than with previously discussedmethods, namely by measuring deformation in areas,where underlying muscles interact. Unfortunately, theseare mostly skin regions with relatively poor texture.Highlighting is necessary and can be done by either ap-plying color to salient facial features and skin [57] or byaFxing colored plastic dots to prede0ned locations onthe subject’s face, see the illustration on left-hand sideof Fig. 6. Markers allow to render tissue motion visibleand were employed in Refs. [3,58].

Note that even though the tracking of feature points ormarkers allows to extract motion, often only relative fea-ture point locations, i.e. deformation information was usedfor the analysis of facial expressions, e.g. in Ref. [50] or[58]. Yet another way of how to extract image motion aredi<erence-images: Speci0cally for facial expression analy-sis, di7erence-images are mostly created by subtracting agiven facial image from a previously registered referenceimage, containing a neutral face of the same subject. How-ever, in comparison to optical 2ow approaches, no 2owdirection can be extracted, but only di7erences of imageintensities. In addition, accurate face normalization proce-dures are necessary in order to align reference faces onto thetest faces. Holistic di7erence-image-based motion extrac-tion was employed in Refs. [27,33,54,55]. Choudhury andPentland [56] used motion 0eld histograms for the modelingof eye and eye brow actions. Motion was also extracted bydi7erence-images, but taken from consecutive image framesand further processed by using local receptive 0eld his-tograms [71] in order to increase robustness with regard torotation, translation and scale changes.

3.5. Classi�cation

Feature classi0cation is performed in the last stage of anautomatic facial expression analysis system. This can beachieved by either attempting facial expression recognitionusing sign-based facial action coding schemes or interpreta-tion in combination with judgment or sign=dictionary-basedframeworks. We can distinguish spatial and spatio-temporalclassi0er approaches:

• Spatio-temporal approaches: Hidden Markov mod-els (HMM) are commonly used in the 0eld of speechrecognition, but are also useful for facial expressionanalysis as they allow to model the dynamics of facialactions. Several HMM-based classi0cation approachescan be found in the literature [44,72,73] and were mostlyemployed in conjunction with image motion extractionmethods. Recurrent neural networks constitute an alter-native to HMMs and were also used for the task of facialexpression classi0cation [53,74]. Another way of takingtemporal evolution of facial expression into account areso-called spatio-temporal motion-energy templates. Here,facial motion is represented in terms of 2D motion 0elds.The Euclidean distance between two templates can thenbe used to estimate the prevalent facial expression [24].

• Spatial approaches: Neural networks were often used forfacial expression classi0cation [32,33,35,39,42,45,75].They were either applied directly on face images [27,32]or combined with facial features extraction and repre-sentation methods such as PCA independent componentanalysis (ICA) or Gabor wavelet 0lters [27,31]. Theformer are unsupervised statistical analysis methods thatallow for a considerable dimensionality reduction, whichboth simpli0es and enhances subsequent classi0cation.These methods have been employed both in a holisticmanner [33,54,55] or locally, using mosaic-like patchesextracted from small facial regions [31,33,35,54]. Daileyand Cottrell [31] applied both local PCA and Gabor jetsfor the task of facial expression recognition and obtainedquantitatively indistinguishable results for both repre-sentations. Fig. 7 shows an illustration of PCA and ICAcomponents obtained from facial expression images.Unfortunately, neural networks are diFcult to train ifused for the classi0cation of not only basic emotions,but unconstrained facial expressions. A problem is thegreat number of possible facial action combinations,about 7000 AU combinations have been identi0ed withinthe FACS framework [18]. An alternative to classicallytrained neural networks constitute compiled, rule-basedneural networks that were employed e.g. in Ref. [58].

3.5.1. Facial expression recognitionTraditional approaches for modeling characteristics of

facial motion and deformation have relied on hand-craftedrules and symbolic mid-level representations for emotionalstates, which have been introduced by computer scientists


Fig. 7. Facial feature representation using data-driven methods:Sample di7erence-images are shown in the 0rst row, correspondingholistic ICA components in the second and PCA components in thethird row. The di7erence-images where computed by subtractinga neutral face reference image from face images displaying facialactions.

in the course of their investigations on facial expressions[4,30,47]. Human expertise is necessary to map these sym-bolic representations into e.g. emotions. However, facialsignals consist of numerous distinct expressions, each withspeci0c facial action intensity evolutions. In addition, in-dividual realizations of facial expressions often di7er onlyin subtle ways. This makes the task of manually creat-ing facial expression classes rather diFcult. Therefore,another group of researchers have relied on facial expres-sion coding schemes such as MPEG-4 [46,60] or FACS[33,41,50,52,54,55,72,76] Essa and Pentland [24] proposedan extension to FACS called FACS+, which consist of aset of control parameters using vision-based observations.In contrast to FACS, FACS+ describes also the dynamicsof facial expressions.

3.5.2. Facial expression interpretationMany automatic facial expression analysis systems found

in the literature attempt to directly interpret observed fa-cial expressions and mostly in terms of basic emotions[21,22,24,39,40,42,43,51,75,77,78]. Only a few systems userules or facial expression dictionaries in order to translatecoded facial actions into emotion categories [30,41]. Thelatter approaches have not only the advantage of accuratelydescribing facial expressions without resorting to interpre-tation, but allow also to animate synthetic faces, e.g. withinthe FACS coding framework [15]. This is of interest, asanimated synthetic faces make a direct inspection of auto-matically recognized facial expressions possible. See alsoRef. [79] for an introduction to automatic facial expressioninterpretation.

4. Representative facial expression recognition systems

In this section, we have a closer look at a few rep-resentative facial expression analysis systems. First, wediscuss deformation and motion-based feature extraction

systems. Then we introduce hybrid facial expression analy-sis systems, which employ several image analysis methodsthat complete each other and thus allow for a better overallperformance. Multi-modal frameworks on the other handintegrate other non-verbal communication channels forimproved facial expression interpretation results. Finally,uni�ed frameworks focus on multiple facial characteristics,allowing for synergy e7ects between di7erent modalities.

4.1. Deformation extraction-based systems

Padgett et al. [77] presented an automatic facial expres-sion interpretation system that was capable of identifyingsix basic emotions. Facial data was extracted from 32 × 32pixel blocks that were placed on the eyes as well as themouth and projected onto the top 15 PCA eigenvectors of900 random patches, which were extracted from trainingimages. For classi0cation, the normalized projections werefed into an ensemble of 11 neural networks. Their outputwas summed and normalized again by dividing the averageoutputs for each possible emotion across all networks bytheir respective deviation over the entire training set. Thelargest score for a particular input was considered to be theemotion found by the ensemble of networks. Altogether 97images of six emotions from 6 males and 6 females wereanalyzed and a 86% generalization performance was mea-sured on novel face images. Lyons et al. [43] presented aGabor wavelet-based facial expression analysis framework,featuring a node grid of Gabor jets, similar to what was usedby the Von der Malsburg group for the task of face recog-nition [80]. Hereby, each test image was convolved with aset of Gabor 0lters, whose responses are highly correlatedand redundant at neighboring pixels. Therefore, it was onlynecessary to acquire samples at speci0c points on a sparsegrid covering the face. The projections of the 0lter responsesalong discriminant vectors, calculated from the training set,were compared at corresponding spatial frequency, orienta-tion and locations of two face images, where the normalizeddot product was used to measure the similarity of two Gaborresponse vectors. Lyons et al. placed graphs manually ontothe faces in order to obtain a better precision for the taskof facial expression recognition. Experiments were carriedout on subsets of totally six di7erent posed expressions andneutral faces of 9 Japanese female undergraduates. A gen-eralization rate of 92% was obtained for the recognition ofnew expressions of known subjects and 75% for the recog-nition of facial expressions of novel expressers.

4.2. Motion extraction-based systems

Black and Yacoob [30] analyzed facial expressions withparameterized models for the mouth, the eyes and the eyebrows and represented image 2ow with low-order poly-nomials [81]. A concise description of facial motion wasachieved with the aid of a small number of parameters fromwhich they derived mid- and high-level description of facial


FACS

Representation

Actions

2

RecognitionExtraction Extraction

zationLocali-

Face

Normalization SegmentationAquisition

Facial ExpressionFacial Feature

Deformation Motion Code

Manual

Interpretation

Mental ActivityCategories

FeatureFacial

8653

Matches

DifferenceImages

ProfileMeasures

Face

FaceLower

Part

Upper

Part

ProfilesIntensity

TemplateFlow Neural

PCA

Network

1

Projections

DenseOpticalFlow

Warping+

Scaling

4 7

Fig. 8. Hybrid facial expression analysis system proposed by Bartlett et al. [82]. The three analysis methods employed showed to producedi7erent error patterns and thus allow for an improved recognition performance if combined.

actions. The latter considered also temporal consistency ofthe mid-level predicates in order to minimize the e7ects ofnoise and inaccuracies with regard to the motion and de-formation of the models. Hence, each facial expression wasmodeled by registering the intensities of the mid-level pa-rameters within temporal segments (beginning, apex, end-ing). Extensive experiments were carried out on 40 sub-jects in the laboratory with a 95–100% correct recognitionrate and also with television and movie sequences resultingin a 60–100% correct recognition rate. Black and Yacoobproved that a recognition of basic emotions was possible alsoin presence of signi0cant pose variations and head motion.Essa and Pentland [24] presented a rather complete computervision system featuring both automatic face detection andface analysis. Facial motion was extracted with the aid ofholistic dense optical 2ow and coupled with 3D motion andmuscle-based face models. The latter allowed to describethe facial structure including facial tissue as well as muscleactuators and their force-based deformation. Essa and Pent-land located test faces automatically by using a view-basedand modular eigenspace method and determined also the po-sition of facial features. The latter were then employed inorder to warp face images to match canonical face meshes,which in turn allowed to extract additional feature pointscorresponding to 0xed nodes on the meshes. After the initialmodel to image registration, Simoncelli’s [68] coarse-to-0neoptical 2ow was used to compute image motion. In addition,a Kalman 0lter-based control framework was applied in or-der to prevent chaotic responses of the physical system. Theemployed dynamic face model allowed not only to extractmuscle actuations of observed facial expressions, but it wasalso possible to produce noise corrected 2D motion 0elds viathe control-theoretic approach. The latter where then classi-0ed with motion energy templates in order to extract facialactions. Experiments were carried out on 52 frontal viewimage sequences with a correct recognition rate of 98% forboth the muscle and the 2D motion energy models.

4.3. Hybrid systems

Hybrid facial expression analysis systems combine sev-eral facial expression analysis methods. This is most bene0-cial, if the individual estimators produce very di7erent errorpatterns. Bartlett et al. [82] proposed a system that integratesholistic di7erence-images motion extraction coupled withPCA, feature measurements along prede0ned intensity pro-0les for the estimation of wrinkles and holistic dense opti-cal 2ow for whole-face motion extraction, see Fig. 8. Thesethree methods were compared with regard to their contri-bution to the facial expressions recognition task. Bartlett etal. estimated that without feature measurement, there wouldhave been a 40% decrease of the improvement gained byall methods combined. Faces were normalized by alignmentthrough scaling, rotation and warping of aspect ratios. How-ever, eye and mouth centers were located manually in theneutral face frame, each test sequence had to start with. Fa-cial expression recognition was achieved with the aid of afeed-forward neural network, made up of 10 hidden and sixoutput units. The input of the neural network consisted of 50PCA component projections, 0ve feature density measure-ments and six optical 2ow-based template matches. A win-ner takes it all (WTA) judgment approach was chosen to se-lect the 0nal AU candidates. Initially, Bartlett et al.’s hybridfacial expression analysis system was able to classify six up-per FACS action units on a database containing 20 subjects,correctly recognizing 92% of the AU activations, but no AUintensities. Later it was extended to allow also for the clas-si0cation of lower FACS action units and achieved a 96%accuracy for 12 lower and upper face actions [33,54]. Cohnet al. [72] and Lien et al. [76] introduced systems that werebased on holistic image 2ow analysis, feature point track-ing and high-gradient component analysis methods, whichwere integrated into a spatio-temporal framework and com-bined with an HMM recognition stage, see Fig. 9. Local facemotion was estimated by feature point tracking. Hereby, a


FACS

FACSActions

Intensity

Representation

2

Frame

Deformation Motion

Facial Expression

InterpretationCode

Locali-

Face

Normalization SegmentationAquisition

in First

Facial Feature

Point

zation

RecognitionExtraction Extraction

Mental ActivityCategories

ManualFeature

653

Motion

81

ShortestDistance

Max. Like.HMM

Projection

DistributionGradient

Vector

PCA

mentVector

Displace-

TrackingPoint

Feature

OpticalDense

Transient Gradient

Flow

FaceImage

lationTrans-

Wrinkle

Scale

Rotation

Rigid

+

Component AnalysisRegions+

4 7

+

Fig. 9. Hybrid facial expression analysis system proposed by Lien et al. [76]. The whole facial expression analysis framework is situatedin the spatio-temporal domain, including the classi0cation stage that is driven by HMM.

Table 2Selected facial expression recognition systems: Classi0cation of facial actions using FACS

Authors # Subjects # Faces # FACS Extraction Classi0cation Rec. rate (%)AUs methods methods

Test Train Test Train

Bartlett [33] 4–20 4–20 111∗5 111∗5 6a+6b Di7 − Img: + G. Jets Near. neigh. 964–20 4–20 111∗5 111∗5 6a+6b Di7. Img. ICA + N. neigh. 964–20 4–20 111∗5 111∗5 6a+6b Optical 2ow Motion Templ. 86

Fasel [55] 1 1 45 182 9c Di7 :− Img. ICA + Eucl. Dist. 83=41

Cohn [72] 30 N=A N=A N=A 15 Feat. point track. HMM 86

Lien [76] N=A N=A 75∗20 60∗20 3a Feat. point track. HMM 85N=A N=A 150∗20 120∗20 6b Feat. point track. HMM 88N=A N=A 75∗20 44∗20 3a Optical 2ow PCA + HMM 93N=A N=A 160∗20 100∗20 4a High grad. comp. HMM 85N=A N=A 80∗20 50∗20 2b High grad. comp. HMM 81

Pantic [41] 8 N=A 496 N=A 31 Multi feat. detect. Expert rules 89

aUpper-face FACS coding.bLower-face FACS coding.cNine asymmetric FACS classes, each with 0ve intensity levels on two face sides.

pyramidal multi-scale search approach was employed that issensitive to subtle feature motion and also allowed to tracklarge displacements of feature motion with sub-pixel accu-racy. Holistic facial motion on the other hand was estimatedby employing Wu’s multi-resolution wavelet-based optical2ow [66]. Forehead, cheek and chin regions were analyzedfor transient facial features by using high-gradient compo-nent analysis based on horizontal, vertical and diagonal lineand edge detectors in the spatial and frame comparisons inthe temporal domain. The latter allowed to separate tran-sient from intransient facial features and hair occlusion. Facetracking and face alignment were manually initialized by se-

lecting three facial feature points in the 0rst frame of eachtest image sequence. Lien et al.’s system was trained to an-alyze both the activity and intensity of 15 FACS AUs situ-ated in the brow, eye and mouth regions. The holistic denseoptical 2ow approach gave the best average AU recognitionrates, followed by feature point tracking and the high gradi-ent component analysis approach, see also Table 2. Lien etal.’s facial expression analysis system performed well evenwith diFcult sequences such as those containing baby faces.The latter di7er from adult faces both in morphology andtissue texture. Unfortunately, heavy computational require-ments arised with the use of optical 2ow.


Representation

BasicEmotions

2

Categories

Locali-

Deformation Motion

Manual

Face

Normalization SegmentationAquisition Code

Facial Feature Facial Expression

InterpretationRecognitionExtraction Extraction

Mental Activity

Face

65 873

zation

WarpingImage

MaskingFaceWhole

Distance

Mahala-

1

MatchesPatchFreeShape-

ShapeModelFitting

ShapeModel

nobis

4

Fig. 10. Uni0ed facial expression analysis framework proposed by Lanitis et al. [21]. It is based on active shape models (ASM). Shapeinformation is also used to extract shape-free whole-face patches that represent texture information.

4.4. Multimodal frameworks

Today, most facial expression analysis systems are of theunimodal type, as they focus only on facial expressions whendetermining mental activities. However, the evaluation ofmultiple communication channels may foster robustness aswell as improve correct interpretation of facial expressions inambiguous situations. At present, most attempts of channelfusion are of the bimodal type and integrate voice in additionto facial expressions. Vocal expressions are conveyed byprosodic features, which include the fundamental frequency,intensity and rhythm of the voice. Cohn and Katz [83] as wellas Chen et al. [84] focused on the fundamental frequency,as it is an important voice feature for emotion recognitionand can be easily extracted.

4.5. Uni�ed frameworks

Facial expression recognition may be improved by con-sidering not only facial actions but also face characteristicssuch as identity, gender, age and ethnicity. Lanitis et al. [21]proposed a uni0ed framework that performs multiple faceprocessing tasks in parallel, albeit without in2uencing eachother. They employed 2D active shape models that yield,once aligned to a test face, face appearance parameters fromwhich it is possible to estimate 3D pose, identity, genderand facial expressions. During training, shape models arederived from a set of images by statistical analysis of thelandmark point positions. These represent main facial fea-tures and are placed manually on training images prior tothe shape model creation process, see Fig. 10. During test-ing, facial features are located in test image using ASMsearch [85], guided by the 2exible shape models obtainedduring training. Gray-level pro0le information is collectedat each model point and used to determine the best 0ttingshape model. Lanitis et al. then deformed test faces to the

mean face shape by using the previously registered shapeinformation in order to extract holistic shape-less patches,which account for facial texture. Finally, Hong et al. [22]proposed an online facial expression recognition system,which is based on personalized galleries and uses identityinformation in conjunction with facial expression analysis.Faces of interest are detected and tracked in live video se-quences with the aid of the PersonSpotter system [23] andrecognition of facial expressions is achieved by performingelastic graph matching. The nodes of the employed graphsare comprised of Gabor jets and are 0tted to given test faces.The obtained graph is 0rst used to determine the identity ofa given subject by choosing the closest match. In a secondstage, the closest matching graph found in the personalizedgallery of the identi0ed person is used to determine the dis-played facial expression, thus allowing for a better focus onintra-personal variations.

5. Discussion

In this survey on automatic facial expression analysis,we have discussed automatic face analysis with regard todi7erent motion and deformation-based extraction meth-ods, model and image-based representation techniques aswell as recognition and interpretation-based classi0cationapproaches. It is not possible to directly compare facial ex-pression recognition results of face analysis systems foundin the literature due to varying facial action labeling and dif-ferent test beds that were used for the assessment. Recently,a rather complete facial expression database has been pre-sented by Kanade et al. [86]. However, there is a lack of pub-licly available, all encompassing facial expression databasesthat would allow for testing facial expression analysis meth-ods in a more transparent way. Nonetheless, we tried to char-acterize a few selected systems with regard to the employed


Table 3Selected facial expression interpretation systems: Classi0cation of emotional displays. Note that the systems presented in the last two rowsperform facial action recognition prior to using dictionaries in order to interpret facial actions

Authors # Subjects # Faces # Em. Extraction Classi0cation Rec. rate (%)class. methods methods

Test Train Test Train

Lyons [43] 9 N=A 193 N=A 7 G:Wav: + El:Gr. LDA + PCA + Cl. 75–92Kobayashi [39] 15 N=A 90 N=A 6 — Feed Forw. NN 85Rosenblum [53] 32 20=14 34 sq. 20=14 sq. 2 Optical 2ow RBF NN 88Padgett [35] 12 N=A N=A N=A 6 — PCA + NN 86Essa [24] 8 8 8 sq. 8 sq. 6 Opt:F: + 3D M. Motion templ. 98Lanitis [21] 30 30 300 390 7 Appear. mod. Mahal. distance 74Black [30] 40 N=A 70 sq. N=A 6 Motion mod. Expert rules 83–100Pantic [41] 8 N=A 496 N=A 6 Mul. feat. det. Expert rules 91

feature extraction and classi0cation methods. Table 2 listssystems that perform facial expression recognition by classi-fying facial actions, while Table 3 presents systems that at-tempt both direct interpretation of emotional facial displaysor indirect interpretation via facial expression dictionaries.The application of currently available automatic facial ex-pression recognition systems is often very restricted due tothe limited robustness and hard constraints imposed on therecording conditions. Many systems assume faces to be cen-tered in the input image and seen from a near frontal viewthroughout the whole test sequence. Also, it is often takenfor granted that there are only small rigid head motions be-tween any two consecutive frames. In addition, most facialexpression analysis systems require important manual in-tervention for the detection and accurate normalization oftest faces, during the initialization of facial feature trackingapproaches or for warping video sequences. Most facial ex-pression analysis systems are limited to the analysis of eitherstatic images or image sequences. However, an ideal systemshould be capable of analyzing both static images as wellas image sequences, as there are sometimes no image se-quence available, respectively if there are, motion should beextracted in order to obtain directional information of skinand facial feature deformation. Furthermore, the measure-ment of facial expression intensities has only be addressedby a few systems [32,34,55]. It is of importance for the in-terpretation of facial expressions, especially when attempt-ing to analyze the temporal evolution and timing of facialactions. Out-of-plane rotated faces are diFcult to tackle andonly a few approaches found in the literature were able todeal with this problem: Active appearance models [21,38],local parametric models [30,47], 3D motion models [48,49]and to some degree also feature point tracking approaches[34,50]. Hybrid facial expression analysis systems are alsoof interest, as they combine di7erent face analysis methodsand may thus give better recognition results than the indi-vidual methods applied on their own [33]. This is true, ifthe employed extraction algorithms focus on di7erent fa-cial features or the combined extraction, representation andrecognition stages produce di7erent error patterns.

6. Conclusion

Today, most facial expression analysis systems attemptto map facial expressions directly into basic emotional cat-egories and are thus unable to handle facial actions causedby non-emotional mental and physiological activities. FACSmay provide a solution to this dilemma, as it allows to clas-sify facial actions prior to any interpretation attempts. Sofar, only marker-based systems are able to reliably code allFACS action unit activities and intensities [58]. More workhas to be done in the 0eld of automatic facial expression in-terpretation with regard to the integration of other communi-cation channels such as voice and gestures. Although facialexpressions often occur during conversations [87], none ofthe cited approaches did consider this possibility. If auto-matic facial expression analysis systems are to be operatedautonomously, current feature extraction methods have to beimproved and extended with regard to robustness in naturalenvironments as well as independence of manual interven-tion during initialization and deployment.

7. Summary

In recent years, facial expression analysis has become anactive research area. Various approaches have been made to-wards robust facial expression recognition, applying di7er-ent image acquisition, analysis and classi0cation methods.Facial expression analysis is an inherently multi-disciplinary0eld and it is important to look at it from all domains in-volved in order to gain insight on how to build reliable auto-mated facial expression analysis systems. This fact has oftenbeen neglected in various implementations presented in theliterature. Facial expressions re2ect not only emotions, butalso cognitive processes, social interaction and physiologicalsignals. However, most facial expression analysis systemshave attempted to map facial expressions directly towardsbasic emotions, which represents an ill-posed problem. De-coupling facial expression recognition and facial expressioninterpretation may provide a solution to this dilemma. Thiscan be achieved by 0rst coding facial expressions with an


appearance-based representation scheme such as facial ac-tion coding system (FACS) and then using facial expressiondictionaries in order to translate recognized facial actionsinto mental acitivity categories.

In this survey, we have reviewed the most prominentautomatic facial expression analysis methods and systemspresented in the literature. Facial motion and deformationextraction approaches as well as facial feature representationand classi0cation methods were discussed with respect to is-sues such as face normalization, facial expression dynamicsand intensity, as well as robustness towards environmentalchanges.

The application of currently available automatic facial ex-pression recognition systems to the analysis of natural scenesis often very restricted due to the limited robustness of thesesystems and the hard constraints posed on the subject andon the recording conditions. Especially out-of-plane rotatedand partly occluded faces due to facial hair or sun glassesare diFcult to handle. Furthermore, many analysis methodsmake the hypothesis that faces are centered in the image andseen from a near frontal view throughout test sequences. Of-ten, small rigid head motion between any two consecutiveframes is assumed. Most facial expression analysis systemsneed important manual intervention for the accurate normal-ization of test faces and initialization of extraction methodssuch as localization of facial feature points, facial featuretemplate selection and manual warping of video sequences.

In this article, we also had a closer look at a few repre-sentative facial expression analysis systems and discusseddeformation and motion-based feature extraction systems,hybrid systems based on multiple complementary faceprocessing tasks and multimodal systems that integratee.g. visual and acoustic signals. As we have seen, uni0edframeworks are of interest as well, as they allow to focuson multiple facial characteristics such as face identity andfacial expression displays and thus allow for synergy e7ectsbetween di7erent modalities. We concluded this survey bysummarizing recognition results and shortcomings of cur-rently employed analysis methods and proposed possiblefuture research directions.

Various applications using automatic facial expressionanalysis can be envisaged in the near future, fosteringfurther interest in doing research in the 0elds of facial ex-pression recognition, facial expression interpretation andthe facial expression animation. Non-verbal informationtransmitted by facial expressions is of great importance indi7erent areas, including image understanding, psycholog-ical studies, facial nerve grading in medicine, face imagecompression and synthetic face animation, more engaginghuman–machine interfaces, view-indexing, robotics as wellas virtual reality.

References

[1] C. Darwin, The Expression of the Emotions in Man andAnimals, J. Murray, London, 1872.

[2] P. Ekman, W.V. Friesen, Constants across cultures in the faceand emotion, J. Personality Social Psychol. 17 (2) (1971)124–129.

[3] M. Suwa, N. Sugie, K. Fujimora, A preliminary note on patternrecognition of human emotional expression, Proceedingsof the Fourth International Joint Conference on PatternRecognition, Kyoto, Japan, 1978, pp. 408–410.

[4] K. Mase, A. Pentland, Recognition of facial expression fromoptical 2ow, IEICE Trans. E 74 (10) (1991) 3474–3483.

[5] P. Dulguerov, F. Marchal, D. Wang, C. Gysin, P. Gidley, B.Gantz, J. Rubinstein, S. Sei7, L. Poon, K. Lun, Y. Ng, Reviewof objective topographic facial nerve evaluation methods, Am.J. Otol. 20 (5) (1999) 672–678.

[6] R. Koenen, Mpeg-4 Project Overview, InternationalOrganisation for Standartistion, ISO=IEC JTC1=SC29=WG11,La Baule, 2000.

[7] D. Messinger, A. Fogel, K.L. Dickson, What’s in a smile?Develop. Psychol. 35 (3) (1999) 701–708.

[8] G. Schwartz, P. Fair, P. Salt, M. Mandel, G. Klerman, Facialexpression and imagery in depression: an electromyographicstudy, Psychosomatic Med. 38 (337–347) (1976).

[9] P. Ekman, Emotions in the Human Face, CambridgeUniversity Press, Cambridge, 1982.

[10] P. Ekman, W.V. Friesen, Facial Action Coding System:A Technique for the Measurement of Facial Movement,Consulting Psychologists Press, Palo Alto, 1978.

[11] W. Friesen, P. Ekman, Emotional facial action coding system,unpublished manual, 1984.

[12] C. Izard, The maximally descriminative facial movementcoding system (MAX), Available from Instructional ResourceCenter, University of Delaware, Newark, Delaware, 1979.

[13] C. Izard, L. Dougherty, E. Hembree, A system for indentifyinga7ect expressions by holistic judgments, unpublishedmanuscript, 1983.

[14] N. Tsapatsoulis, K. Karpouzis, G. Stamou, A fuzzy system foremotion classi0cation based on the MPEG-4 facial de0nitionparameter, European Association on Signal ProcessingEUSIPCO, 2000.

[15] M. Hoch, G. Fleischmann, B. Girod, Modeling and animationof facial expressions based on B-splines, Visual Comput.(1994) 87–95.

[16] W. Friesen, P. Ekman, Dictionary—interpretation of FACSscoring, unpublished manuscript, 1987.

[17] P. Ekman, E. Rosenberg, J. Hager, Facial actioncoding system a7ect interpretation database (FACSAID),http://nirc.com/Expression/FACSAID/facsaid.html, July 1998.

[18] P. Ekman, Methods for measuring facial actions, in: K.Scherer, P. Ekman (Eds.), Handbook of Methods in NonverbalBehaviour Research, Cambridge University Press, Cambridge,1982, pp. 45–90.

[19] D. Matsumoto, Cultural similarities and di7erences in displayrules, Motivation Emotion 14 (3) (1990) 195–214.

[20] D. Matsumoto, Ethnic di7erences in a7ect intensity,emotion judgments, display rules, and self-reported emotionalexpression, Motivation Emotion 17 (1993) 107–123.

[21] A. Lanitis, C. Taylor, T. Cootes, Automatic interpretation andcoding of face images using 2exible models, IEEE Trans.Pattern Anal. Mach. Intell. 19 (7) (1997) 743–756.

[22] H. Hong, H. Neven, C. Von der Malsburg, Online facialexpression recognition based on personalized galleries,Proceedings of the Second International Conference on

http://nirc.com/Expression/FACSAID/facsaid.html


Automatic Face and Gesture Recognition, (FG’98), IEEE,Nara, Japan, 1998, pp. 354–359.

[23] J. Ste7ens, E. Elagin, H. Neven, PersonSpotter—fast androbust system for human detection, tracking and recognition,Proceedings of theSecondInternationalConferenceonFaceandGestureRecognition, (FG’98),Nara, Japan, 1998, pp. 516–521.

[24] I. Essa, A. Pentland, Coding, analysis, interpretation andrecognition of facial expressions, IEEE Trans. Pattern Anal.Mach. Intell. 19 (7) (1997) 757–763.

[25] A. Pentland, B. Moghaddam, T. Starner, View-based andmodular eigenspaces for face recognition, IEEE Conferenceon Computer Vision and Pattern Recognition, Seattle, NA,USA, 1994, pp. 84–91.

[26] H. Rowley, S. Baluja, T. Kanade, Neural network-based facedetection, IEEE Trans. Pattern Anal. Mach. Intell. 20 (1)(1998) 23–38.

[27] W. Fellenz, J. Taylor, N. Tsapatsoulis, S. Kollias, Comparingtemplate-based, feature-based and supervised classi0cation offacial expressions from static images, Proceedings of Circuits,Systems, Communications and Computers (CSCC’99),Nugata, Japan, 1999, pp. 5331–5336.

[28] P. Belhumeur, J. Hespanha, D. Kriegman, Eigenfaces vs.0sherfaces: recognition using class speci0c linear projection,IEEETrans. PatternAnal.Mach. Intell. 19 (7) (1997) 711–720.

[29] M. Black, D. Fleet, Y. Yacoob, A framework for modelingappearance change in image sequences, Sixth InternationalConference on Computer Vision (ICCV’98), IEEE ComputerSociety Press, Silverspring, MD, 1998.

[30] M. Black, Y. Yacoob, Recognizing facial expressions in imagesequences using local parameterized models of image motion,Internat. J. Comput. Vision 25 (1) (1997) 23–48.

[31] M. Dailey, G. Cottrell, PCA Gabor for expression recognition,Institution UCSD, Number CS-629, 1999.

[32] C. Lisetti, D. Rumelhart, Facial expression recognition usinga neural network, Proceedings of the 11th International FlairsConference, AAAI Press, New York, 1998.

[33] M. Bartlett, Face image analysis by unsupervised learning andredundancy reduction, Ph.D. Thesis, University of California,San Diego, 1998.

[34] J. Lien, Automatic recognition of facial expression usinghidden Markov models and estimation of expression intensity,Ph.D. Thesis, The Robotics Institute, CMU, April 1998.

[35] C. Padgett, G. Cottrell, Representing face image for emotionclassi0cation, in: M. Mozer, M. Jordan, T. Petsche (Eds.),Advances in Neural Information Processing Systems, Vol. 9,MIT Press, Cambridge, MA, pp. 894–900.

[36] G.W. Cottrell, J. Metcalfe, EMPATH: face, genderand emotion recognition using holons, in: R. Lippman,J. Moody, D. Touretzky (Eds.), Advances in NeuralInformation Processing Systems, Morgan Kaufman, SanMateo, CA, Vol. 3, 1991, pp. 564–571.

[37] T. Cootes, G. Edwards, C. Taylor, Active appearance models,IEEE PAMI 23 (6) (2001) 681–685.

[38] G. Edwards, T. Cootes, C. Taylor, Face recognition usingactive appearance models, Proceedings of the Fifth EuropeanConference on Computer Vision (ECCV), Vol. 2, Universityof Freiburg, Germany, 1998, pp. 581–695.

[39] H. Kobayashi, F. Hara, Facial interaction between animated3D face robot and human beings, Proceedings of theInternational Conference on Systems, Man and Cybernetics,Orlando, FL, USA, 1997, pp. 3732–3737.

[40] C. Huang, Y, Huang, Facial expression recognition usingmodel-based feature extraction and action parametersclassi0cation, J. Visual Commun. Image Representation 8 (3)(1997) 278–290.

[41] M. Pantic, L. Rothkrantz, Expert system for automatic analysisof facial expression, Image Vision Comput. J. 18 (11) (2000)881–905.

[42] Z. Zhang, M. Lyons, M. Schuster, S. Akamatsu, Comparisonbetween geometry-based and Gabor-wavelets-based facialexpression recognition using multi-layer perceptron, IEEEProceedings of the Second International Conference onAutomatic Face and Gesture Recognition (FG’98), Nara,Japan, 1998, pp. 454–459.

[43] M. Lyons, J. Budynek, S. Akamatsu, Automatic classi0cationof single facial images, IEEE Trans. Pattern Anal. Mach.Intell. 21 (12) (1999).

[44] T. Otsuka, J. Ohya, Spotting segments displaying facialexpression from image sequences using HMM, IEEEProceedings of the Second International Conference onAutomatic Face and Gesture Recognition (FG’98), Nara,Japan, 1998, pp. 442–447.

[45] M. Yoneyama, Y. Iwano, A. Ohtake, K. Shirai, Facialexpression recognition using discrete Hop0eld neuralnetworks, Proceedings of the International Conference onImage Processing (ICIP), Santa Barbara, CA, USA, Vol. 3,1997, pp. 117–120.

[46] P. Eisert, B. Girod, Facial expression analysis for model-basedcoding of video sequences, Picture Coding Symposium,Berlin, Germany, 1997, pp. 33–38.

[47] Y. Yacoob, L.S. Davis, Recognizing human facial expressionfrom long image sequences using optical 2ow, IEEE Trans.Pattern Anal. Mach. Intell. 18 (6) (1996) 636–642.

[48] D. DeCarlo, D. Metaxas, The integration of optical 2owand deformable models with applications to human faceshape and motion estimation, Proceedings of the InternationalConference on Computer Vision and Pattern Recognition(CVPR’96), 1996, pp. 231–238.

[49] S. Basu, N. Oliver, A. Pentland, 3D modeling and tracking ofhuman lip motions, Proceedings of ICCV 98, Bombay, India,1998.

[50] Y. Tian, T. Kanade, J. Cohn, Recognizing action units forfacial expression analysis, IEEE Trans. Pattern Anal. Mach.Intell. 23 (2) (2001) 97–115.

[51] M. Wang, Y. Iwai, M. Yachida, Expression recognitionfrom time-sequential facial images by use of expressionchange model, IEEE Proceedings of the Second InternationalConference on Automatic Face and Gesture Recognition(FG’98), Nara, Japan, 1998, pp. 324–329.

[52] T. Otsuka, J. Ohya, Extracting facial motion parameters bytracking feature points, Proceedings of First InternationalConference on Advanced Multimedia Content Processing,Osaka, Japan, 1998, pp. 442–453.

[53] M. Rosenblum, Y. Yacoob, L. Davis, Human expressionrecognition from motion using a radial basis function networkarchitecture, IEEE Trans. Neural Networks 7 (5) (1996)1121–1138.

[54] G. Donato, S. Bartlett C. Hager, P. Ekman, J. Sejnowski,Classifying facial actions, IEEE Trans. Pattern Anal. Mach.Intell. 21 (10) (1999) 974–989.

[55] B. Fasel, J. Luettin, Recognition of asymmetric facial actionunit activities and intensities, Proceedings of the International


Conference on Pattern Recognition (ICPR 2000), Barcelona,Spain, 2000.

[56] T. Choudhury, A. Pentland, Motion 0eld histograms forrobust modeling of facial expressions, Proceedings of theInternational Conference on Pattern Recognition (ICPR2000), Barcelona, Spain, 2000.

[57] B. Bascle, A. Blake, Separability of pose and expression infacial tracking and animation, Proceedings of the InternationalConference on Computer Vision, Bombay, India, 1998.

[58] S. Kaiser, T. Wehrle, Automated coding of facial behaviorin human–computer interactions with FACS, J. NonverbalBehavior 16 (2) (1992) 67–83.

[59] I. Essa, A. Pentland, Facial expression recognition using adynamic model and motion energy, IEEE Proceedings of theFifth International Conference on Computer Vision (ICCV1995), Cambridge, MA, 1995, pp. 360–367.

[60] K. Karpouzis, G. Votsis, G. Moschovitis, S. Kollias,Emotion recognition using feature extraction and 3-D models,Proceedings of IMACS International Multiconference onCircuits and Systems Communications and Computers(CSCC’99), Athens, Greece, 1999, pp. 5371–5376.

[61] W. Hardcastle, Physiology of Speech Production, AcademicPress, New York, 1976.

[62] D. Pollen, S. Ronner, Phase relationship between adjacentsimple cells in the visual cortex, Science 212 (1981)1409–1411.

[63] J. Daugman, Complete discrete 2D Gabor transform by neuralnetworks for image analysis and compression, IEEE Trans.Acoustics, Speech Signal Process. 36 (1988) 1169–1179.

[64] I. Matthews, Active shape model toolbox, University of EastAnglia, Norwich, UK, Matlab Toolbox version 2.0.0, July1997.

[65] H. Nagel, On the estimation of optical 2ow: relations betweendi7erent approaches and some new results Artif. Intell. 33(1987) 299–324.

[66] Y. Wu, T. Kanade, J. Cohn, C. Li, Optical 2ow estimationusing wavelet motion model, IEEE International Conferenceon Computer Vision, Bombay, India, 1998, pp. 992–998.

[67] W. Cai, J. Wang, Adaptive multiresolution collocationmethods for initial boundary value problems of nonlinearPDEs, Soc. Ind. Appl. Math. 33 (3) (1996) 937–970.

[68] E. Simoncelli, Distributed representation and analysis ofvisual motion, Ph.D Thesis, Massachusetts Institute ofTechnology, 1993.

[69] M. Abdel-Mottaleb, R. Chellappa, A. Rosenfeld, Binocularmotion stereo using MAP estimation, IEEE CVPR (1993)321–327.

[70] D. Terzopoulos, K. Waters, Analysis of facial images usingphysical and anatomical models, Proceeding of the ThirdInternational Conference on Computer Vision, Osaka, Japan,1990, pp. 727–732.

[71] B. Schiele, J. Crowley, Probabilistic object recognition usingmultidimensional receptive 0eld histograms, Proceedings ofthe International Conference on Pattern Recognition (ICPR1996), Vienna, Austria, 1996.

[72] J. Cohn, A. Zlochower, J. Lien, Y. Wu, T. Kanade, Automatedface coding: a computer-vision based method of facialexpression analysis, Seventh European Conference on FacialExpression Measurement and Meaning, Salzburg, Austria,1997, pp. 329–333.

[73] N. Oliver, A. Pentland, F. Berard, LAFTER: a real-timelips and face tracker with facial expression recognition,

Proceedings of the IEEE Conference on Computer Vision(CVPR97), S. Juan, Puerto Rico, 1997.

[74] H. Kobayashi, F. Hara, Dynamic recognition of basicfacial expressions by discrete-time recurrent neural network,Proceedings of the International Joint Conference on NeuralNetworks, 1993, pp. 155–158.

[75] J. Zhao, G. Kearney, Classifying facial emotions bybackpropagation neural networks with fuzzy inputs,Proceedings of the International Conference on NeuralInformation Processing, Vol. 1, 1996, pp. 454–457.

[76] J. Lien, T. Kanade, J. Cohn, C. Li, Automated facialexpression recognition based on FACS action units, IEEEProceedings of the Second International Conference onAutomatic Face and Gesture Recognition (FG’98), Nara,Japan, 1998.

[77] C. Padgett, G. Cottrell, R. Adolphs, Categorical perception infacial emotion classi0cation, Proceedings of the 18th AnnualConference of the Cognitive Science Society, San Diego,CA, USA, 1996.

[78] S. Kimura, M. Yachida, Facial expression recognition andits degree estimation, IEEE Conference on Computer Visionand Pattern Recognition, San Juan, Puerto Rico, 1997, pp.295–300.

[79] C. Lisetti, D. Schiano, Automatic facial expressioninterpretation: where human–computer interaction, arti0cialintelligence and cognitive science intersect pragmatics andcognition (Special issue on facial information processing:a multidisciplinary perspective) Pragmat. Cognition 8 (1)(2000) 185–235.

[80] M. Lades, J. Vorbruggen, J. Buhmann, J. Lange, C. Vonder Malsburg, Distortion invariant object recognition in thedynamic link architecture, IEEE Trans. Comput. 42 (1993)300–311.

[81] J. Bergen, P. Anandan, K. Hanna, R. Hingorani, Hierarchicalmodel-based motion estimation, in: G. Sandini (Ed.),Proceedings of the Second European Conference on ComputerVision, ECCV-92, Lecture Notes in Computer Science, Vol.588, Springer, Berlin, 1992, pp. 237–252.

[82] M. Bartlett, P. Viola, T. Sejnowski, B. Golomb, J. Larsen,J. Hager, P. Ekman, Classifying facial action, Advances inNeural Information Processing Systems, Vol. 8, MIT Press,Cambridge, 1996.

[83] J. Cohn, G. Katz, Bimodal expression of emotion by faceand voice, Workshop on Face=Gesture Recognition andTheir Applications, Sixth ACM International MultimediaConference, Bristol, UK, 1998.

[84] L. Chen, T. Huang, T. Miyasato, R. Nakatsu, Multimodalhuman emotion=expression recognition, Proceedings of theSecond International Conference on Automatic Face andGesture Recognition (FG’98), IEEE, Nara, Japan, 1998.

[85] T. Cootes, C. Taylor, A. Lanitis, Multi-resolution searchusing active shape models, 12th International Conference onPattern Recognition, Vol. 1, IEEE CS Press, Los Alamitos,CA, 1994, pp. 610–612.

[86] T. Kanade, J. Cohn, Y. Tian, Comprehensive database forfacial expression analysis, IEEE Proceedings of the FourthInternational Conference on Automatic Face and GestureRecognition (FG’00), Grenoble, France, 2000.

[87] P. Ekman, About brows: emotional and conversational signalsin: J. Ascho7, M. Con Carnach, K. Foppa, W. Lepenies, D.Plog (Eds.), Human Ethology, Cambridge University Press,Cambridge, 1979, pp. 169–202.


About the Author—BEAT FASEL graduated from the Swiss Federal Institute of Technology Lausanne (EPFL) with a diploma in Com-munication Systems. He currently works towards a Ph.D. degree at IDIAP in Martigny, Switzerland. His research interests include computervision, pattern recognition and arti0cial intelligence.

About the Author—JUERGEN LUETTIN received a Ph.D. degree in Electronic and Electrical Engineering from the University of SheFeld,UK, in the area of visual speech and speaker recognition. He joined IDIAP in Martigny, Switzerland, in 1996 as a research assistant wherehe worked on multimodal biometrics. From 1997 to 2000, he was head of the computer vision group at IDIAP, where he initiated and leadseveral European Community and Swiss SNF projects in the area of biometrics, speech recognition, face analysis and document recognition.In 2000, he joined Ascom AG in Maegenwil, Switzerland as head of the technology area Pattern Recognition. Dr. Luettin has been avisiting researcher at the Center for Language and Speech Processing at the Johns Hopkins University, Baltimore, in 1997 (large vocabularyconversational speech recognition) and 2000 (audio–visual speech recognition). His research interests include speech recognition, computervision, biometrics, and multimodal recognition.

Date post:	08-Nov-2014
Category:	Documents
Upload:	abhishek-bhandari
View:	23 times
Download:	1 times

Automatic Facial Expression Analysis a Survey

Documents