ã Copyright by Pengyu Hong, 2000hong/Research/e_paper/thesis/thesis.pdf · AN INTEGRATED FRAMEWORK...

Copyright by Pengyu Hong, 2001

i

AN INTEGRATED FRAMEWORK FOR FACE MODELING, FACIAL MOTION ANALYSIS AND SYNTHESIS

BY

PENGYU HONG

BENGR, Tsinghua University, 1995 MENGR, Tsinghua University, 1997

THESIS

i

Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science

in the Graduate College of the University of Illinois at Urbana Champaign, 2001

Urbana, Illinois

iii

ABSTRACT

This dissertation presents an integrated framework for face modeling, facial motion

analysis, and facial motion synthesis. This framework systematically addresses three

closely related research issues: (1) selecting a quantitative representation of facial defor-

mation for face modeling and animation; (2) automatic facial motion analysis based on

the same visual representation; and (3) speech-to-facial-coarticulation modeling. The

framework provides a guideline for methodically building a face modeling and animation

system. The systematicness of the framework is reflected by the links among its compo-

nents, whose details are presented. Based on this framework, a face modeling and anima-

tion system, called the iFACE system, is developed. The system provides functionalities

for customizing a generic face model for an individual, text-driven face animation, off-

line speech-driven face animation, and real-time speech-driven face animation.

iv

ACKNOWLEDGMENTS

I would like to take this opportunity to express my appreciation to all the people who

have guided, supported, and encouraged me.

First, my sincere gratitude goes to my mentor, Professor Thomas S. Huang, for his ad-

vice, support, and encouragement. His kindness and humor made my research environ-

ment much more enjoyable. I thank my committee members, Professor Sylvian R. Ray,

Professor Michael Garland, and Professor David Goldberg for their invaluable comments.

I should particularly thank Dr. Harry Shum at Microsoft Research Lab and Dr. Jialin

Zhong at Bell Labs. In the summer of 1998, I worked for Dr. Zhong as an intern on the

text-to-visual-speech project. I worked for Dr. Shum as a summer intern on the facial mo-

tion modeling and tracking project in 2000. I really enjoyed the inspiring discussions with

Dr. Zhong and Dr. Shum and their invaluable suggestions. My continuous endeavors on

the above directions produce fruitful results that become important components of this

thesis.

I especially thank Professor Stephen E. Levinson. During my thesis work on speech-

driven face animation, Professor Levinson is always ready to provide his generous and

stimulating suggestions based on his expertise in speech processing and recognition.

My lab colleagues have also provided critical help. I especially thank Zhen Wen, You

Zhang, Larry Chen, Steve Chu, Roy Wang, and Ira Cohen. We worked together on the

AVATAR Demo for the Army Research Laboratory Symposium 2000. I specifically

thank Zhen Wen, my closest partner, who has been particularly cooperative and suppor-

tive. I thank other former and current graduate students in our group for the discussions

and support. Thanks also go to the Dr. Russell L. Storms and Dr. Larry Shattuck, who

kindly provided their face data for this research.

Many individuals provided additional valuable technical assistance. The constant, prompt

help of our system administrators, Gabriel Lopez-Walle, Rachael Brady, and Hank

v

Kaczmarski, made the research environment much more comfortable. The secretaries,

Sharon Collins, Wendy Harris, and Kathie Alblinger, are always cheerful and helpful.

Finally, there are others who have influenced this research indirectly, but fundamentally,

through their influence on my life. They are my parents, sister, and brother-in-law, whose

love, patience, and encouragement made this research possible. THANKS!!!

vi

TABLE OF CONTENTS

CHAPTER PAGE

1 INTRODUCTION.........................................................................................................1 1.1 Overview................................................................................................................ 1 1.2 Previous Research.................................................................................................. 2

1.2.1 Face modeling ............................................................................................. 2 1.2.2 Face animation ............................................................................................ 4 1.2.3 Facial motion analysis................................................................................. 9

1.3 The Approach – An Integrated Framework for Face Modeling, Facial Motion Analysis, and Synthesis ...................................................................................... 15

2 MOTION UNITS AND FACE ANIMATION ...........................................................17 2.1 Collect Training Data for Learning Motion Units............................................... 17 2.2 Learning Motion Units ........................................................................................ 20 2.3 Use MUs to Animate Face Model ....................................................................... 24

2.3.1 MU and key frame..................................................................................... 25 2.3.2 MU and MPEG-4 FAP.............................................................................. 26

2.4 Discussion............................................................................................................ 28

3 MU-BASED FACIAL MOTION TRACKING..........................................................30 3.1 Model Initialization ............................................................................................. 30 3.2 Tracking as a Weighted Least Square Fitting Problem ....................................... 31

3.2.1 Modelless tracking .................................................................................... 31 3.2.2 Constrained by MUs.................................................................................. 32

3.3 Improving the MU-based Facial Motion Tracking Algorithm............................ 34 3.4 Experimental Results ........................................................................................... 35 3.5 Discussion............................................................................................................ 37 3.6 3D MU-based Facial Motion Tracking................................................................ 40 3.7 3D MU-based Facial Motion Tracking Using Multiple Cameras ....................... 43 3.8 3D MU-BSV-based Facial Motion Tracking ...................................................... 47

4 MU-BASED REAL-TIME SPEECH-DRIVEN FACE ANIMATION......................51 4.1 Linear Audio-to-Visual Mapping ........................................................................ 51 4.2 Local Linear Audio-to-Visual Mapping .............................................................. 52 4.3 Nonlinear Audio-to-Visual Mapping Using ANN .............................................. 53 4.4 Experimental Results ........................................................................................... 54

4.4.1 Collect training and testing data................................................................ 54 4.4.2 Implementation.......................................................................................... 55

vii

4.4.3 Evaluation.................................................................................................. 55 4.4.4 A speech-driven face animation example ................................................. 58

5 THE IFACE SYSTEM................................................................................................62 5.1 Introduction.......................................................................................................... 62 5.2 Generic Face Model............................................................................................. 62 5.3 Customize the Face Model .................................................................................. 63 5.4 Face Deformation Control Model........................................................................ 67 5.5 Text Driven Face Animation ............................................................................... 70 5.6 Off-line Speech-driven face animation................................................................ 73 5.7 Real-Time Speech-driven face animation ........................................................... 75 5.8 The iFACE System in the Distributed Collaborative Environments................... 77

6 CONCLUSIONS AND FUTURE WORK .................................................................79 6.1 Summary.............................................................................................................. 79 6.2 Future Research ................................................................................................... 80

6.2.1 Explore better visual representation.......................................................... 80 6.2.2 Improve and evaluate the facial motion tracking algorithm ..................... 81 6.2.3 Refine audio-to-visual mapping ................................................................ 82 6.2.4 Human perception on synthetic talking face ............................................. 82 6.2.5 Improve the tongue models ....................................................................... 83

6.3 Improving the Key Frames of the iFACE System............................................... 83

REFERENCES..................................................................................................................84

VITA…….. .......................................................................................................................93

viii

LIST OF TABLES Table Page

Table 2.1 MU, ASM, AAM, eigenlips and eigen-points. ................................................. 23 Table 4.1 Real-time speech driven evaluation I. ............................................................... 56 Table 4.2 Real-time speech driven evaluation II............................................................... 57 Table 4.3 Real-time speech driven evaluation III. ............................................................ 57 Table 4.4 Real-time speech driven evaluation IV. ............................................................ 58 Table 5.1 Phoneme and viseme used in the iFACE system. ............................................. 73

ix

LIST OF FIGURES

Figure Page Figure 1.1 An integrated framework for face modeling, facial motion analysis, and

synthesis. ......................................................................................................... 16 Figure 2.1 An example of the labeled data and the mesh model. ..................................... 18 Figure 2.2 Facial muscles.................................................................................................. 19 Figure 2.3 MUs ................................................................................................................. 22 Figure 2.4 MPEG-4 feature points. ................................................................................... 26 Figure 2.5 The facial animation parameter units............................................................... 27 Figure 3.1 Model initialization for tracking. ..................................................................... 30 Figure 3.2 Comparison of the tracking results on an unmarked face using the MU-

based facial motion tracking algorithm, template matching, and the KLT trackers. ........................................................................................................... 36

Figure 3.3 Comparison of the tracking results on a marked face using the MU-based facial motion tracking algorithm, template matching, and the KLT tracker... 38

Figure 4.1 Local linear audio-to-visual mapping. ............................................................. 53 Figure 4.2 MLP for nonlinear audio-to-visual mapping. .................................................. 54 Figure 4.3 The estimation results of the global linear mapping........................................ 59 Figure 4.4 The estimation results of the local linear mapping. ......................................... 60 Figure 4.5 The estimation results of the nonlinear mapping using neural networks. ....... 61 Figure 5.1 The generic geometry face model.................................................................... 63 Figure 5.2 An example of the CyberwareTM cyberscanner data. ...................................... 64 Figure 5.3 The coarse model in 2D cylindrical coordinate space. .................................... 64 Figure 5.4 The landmarks divide the head surface into many local rectangular regions

in the cylindrical coordinate space. ................................................................. 65 Figure 5.5 Select feature points on the texture map.......................................................... 66 Figure 5.6 A semi-finished face model and the model editor. .......................................... 67 Figure 5.7 Examples of the customized face model. ........................................................ 67 Figure 5.8 The control model............................................................................................ 68 Figure 5.9 Local affine transformation for facial surface deformation............................. 69 Figure 5.10 Create facial shape using the model editor. ................................................... 69 Figure 5.11 Examples of facial expressions and visemes. ................................................ 70 Figure 5.12 The architecture of text driven face animation. ............................................. 72 Figure 5.13 The architecture of off-line speech-driven face animation. ........................... 74 Figure 5.14 An example of off-line speech-driven face animation................................... 75 Figure 5.15 An example of nonlinear real-time speech-driven face animation. ............... 76 Figure 5.16 A shoulder model is added to the face model................................................ 77 Figure 5.17 The iFACE system in a distributed collaborative environment..................... 78

x

1

CHAPTER 1

1 INTRODUCTION

Synthetic graphic talking faces provide an effective solution for delivering and displaying

communication information. The applications include 3D model-based very low bit rate

video coding for visual telecommunication [1], [55], video conferencing [13], and talking

head representation of computer agent [57], [84]. Research has consistently shown that

the perception of speech is inherently multimodal [48], [49], [74]. In noisy environ-

ments, a synthetic talking face can help users to understand the associated speech [48],

and it helps people react more positively in interactive services [61], for example, for E-

commerce. A synthetic talking face is also found to assist students learn better in com-

puter-aided education [16].

Graphic avatars have been developed to enhance conversational cues in multiple user

immersive collaboration environments (e.g., DIVE [11], GreenSpace [47], Interspace

[60], and NetICE [42]). An important research issue in developing avatars is how to natu-

rally and realistically animate the faces of the avatars. In many real world situations, such

as field collaboration, participants are mobile. Therefore, stable high bandwidth cannot be

guaranteed. A real-time speech-driven graphic avatar provides an effective solution. Dis-

tant participants can be represented as graphic avatars and displayed in the immersive

environments. The faces of the avatars are driven by speech that only requires very low

bandwidth to transmit.

1.1 Overview

The rest of the thesis is organized as follows. In the rest of Chapter 1, we review previous

research and present an integrated framework for building a face modeling, facial motion

analysis and synthesis system. The framework provides a guideline to systematically de-

velop face modeling and animation systems. The details of the framework will be de-

scribed in detail in Chapter 2, 3, and 4. First, a quantitative representation of facial de-

2

formation, called Motion Unit (MU), is introduced in Chapter 2. MU is the core compo-

nent of the framework. It will be shown how to use MU for realistic face animation. In

Chapter 3, MUs are used to develop a robust MU-based facial motion tracking algorithm.

In Chapter 4, the tracking algorithm is used to analyze facial movements and an audio-

visual dataset is collected. Two approaches for training real-time audio-to-visual map-

pings are described. Experimental results of the facial motion tracking and real-time au-

dio-to-visual mappings are shown. Based on this framework, we developed a face model-

ing and animation system, called the iFACE system [32], which will be presented in

Chapter 5. The demos of the iFACE system can be found at the following web page:

http://www.ifp.uiuc.edu/~hong/Research/face.htm. Finally, this thesis closes with some

conclusions and future research directions.

1.2 Previous Research

This section reviews previous research on face modeling, facial motion analysis and syn-

thesis, and speech-driven face animation. There has been a large amount of research on

face modeling and animation [24], [64]. One main goal of face modeling is to investigate

how to deform a facial surface spatially, or develop a facial deformation control model.

The key research issue of face animation is how to deform a facial surface temporally, or

construct a facial coarticulation model. To realistically animate the face model, analysis

of real facial motion is required for modeling the facial coarticulation. It has been shown

that facial coarticulation is highly correlated with the vocal track [87]. Speech is an im-

portant medium that has been used to drive a face model. Speech-driven face animation

not only needs to deal with face modeling and animation, but also needs to develop a

mapping from audio to facial coarticulation.

1.2.1 Face modeling

Human faces are commonly modeled as free-form geometric mesh models [32], [35],

[58], [75], [86], parameterized geometric mesh models [62], [63], [65], or physics-based

models [41], [76], [82]. Each face model has its deformation control model.

3

• Free-form face model

Free-form face model approaches explicitly define a control model to deform the face

model. Once the coordinates of the control points are decided, the remaining vertices on

the face model are deformed by interpolation. There are some popular interpolation func-

tions: affine functions [32], B-spline functions [58], cardinal spline and springs [80], ra-

dial basis functions [59], [86], the combination of affine functions and radial basis func-

tions [67], rational functions [35], or Bezier volume model [75].

One of the main research issues of free-form face modeling is how to design an interpola-

tion mechanism that is faithful for real face deformation. So far, no objective evaluation

experiments have been done for the above interpolation methods. If the density of the

control points is high enough, the above deformation methods can be used to approximate

the facial surface. However, high density of control points brings out another problem:

how to move those control points. Manual adjustment can achieve good results, but is

difficult and labor intensive. Automatic adjustment is itself an open question. Adjusting

control points needs to consider the relations among those control points, which are not

independent. The above free-form face modeling methods do not address the relations

among control points in a theoretically sound way.

• Parameterized face model

Parameterized mesh models use a set of parameters to decide the shapes of the face mod-

els [62], [63], [65]. The coordinates of some anchor vertices are first calculated using a

set of predefined functions whose variables are those parameters. The coordinates of the

remaining vertices are then calculated by a set of predefined interpolation functions

whose variables are those parameters and the coordinates of those anchor vertices. How-

ever, there is no systematic way or theoretical basis for designing those functions (both

for the anchor vertices and the rest vertices), deciding the values of the parameters in the

function, and choosing anchor vertices.

4

• Physics-based face model

Physics-based models simulate facial skin, tissue, and muscles by multilayer dense

meshes [41], [76], [82]. Facial surface deformation is triggered by the contractions of the

synthetic facial muscles. The muscle forces are propagated through the skin layer, and

thereby deform the facial surface. The simulation procedure solves a set of dynamics

equations. However, the sophistication of the physical models of facial muscles, skin and

tissue makes physics-based model approaches computationally intensive. In addition, de-

termining the parameters of the physics-based face models is an art.

1.2.2 Face animation

Once the facial deformation control model is decided, a face model can be animated by

temporally adjusting its parameters according to its facial coarticulation model.

• Function-based facial coarticulation model

Some approaches explicitly model facial coarticulation by some forms of functions [15],

[66]. Pelachaud et al. [66] used a look-ahead model for visual speech synthesis. They use

the Facial Action Coding System (FACS), which is proposed by P. Ekman and W.

Friesen [23], to describe facial deformations. FACS is based on anatomical studies on

facial muscular activity and it enumerates all Action Units (AUs) of a face that cause fa-

cial movements. Currently, FACS is widely used as the underlying visual representation

for facial motion analysis, coding, and animation. In their face animation system [66],

AUs are manually designed and are assumed to be additive. In other words, facial defor-

mation can be calculated by linear combination of AUs. Phonemes1 are assigned with

high or low deformability ranks. A set of forward and backward coarticulation rules is

intuitively designed to link the speech intonation and emotion with facial deformation.

The rules describe a set of functions that are used to compute the intensity of a facial ac-

tion unit in proportion to the speech rate.

1 A phoneme is a member of the set of the smallest units of speech that serve to distinguish one utterance from another in a language or dialect.

5

Cohen and Massaro [15] used a parameterized geometrical face model, which is a de-

scendant of Parke’s face model [62]. They adopt the Löfqvist gestural production model

[45] as the facial coarticulation model to drive the face model. Although the Löfqvist ges-

tural production model is based on empirical observations, it is explicit form is designed

subjectively. In addition, the Löfqvist gestural production model requires that the pho-

neme sequence should be known.

Neither approaches described in [15] or [66] are appropriated for real time online face

animation. In addition, the coarticulation functions are designed subjectively and may not

well represent real facial dynamics.

• Performance-driven face animation

The philosophy of performance-driven face animation approaches is that wool comes

from the sheep. This kind of approach automatically analyzes real facial movements us-

ing computer vision techniques. The analysis results are used to animate graphic face

models. Therefore, they can achieve natural face animation by using information about

real facial deformation.

Williams [86] and Guenter et al. [29] used simple computer vision techniques to track the

markers on the face of a human subject. The tracking results are used directly to deform

the face models. In [86], each marker corresponds to a control point on the face model. A

set of warping kernels is designed and used to deform the vertices around the control

points. In [29], vertices are moved by a linear combination of the offsets of the nearest

markers. Nonetheless, both Williams [86] and Guenter et al. [29] required intrusive

markers be put on the face of the subject. As will be discussed in Section 1.2.3, facial

motion analysis is very difficult if the face is not marked.

Other performance-driven face animation systems adopt analysis-based approaches [22],

[25], [44], [75], [77]. Analysis-based approaches extract information from a live video

sequence and use the extracted information for face animation. Such information corre-

sponds to muscle contractions of the physics-based face model in [77], the weights of the

AUs of the FACS in [44], [75], [77], or Motion Picture Experts Group 4 (MPEG-4) face

6

animation parameters (FAPs) [92] in [22]. The face of the subject in [77] is marked to

guarantee accurate tracking results. The tracking techniques used in [22], [44], [75], [77]

will be discussed in Section 1.2.3.

Although temporal information is correctly extracted to some degree, subjectivity is in-

troduced while deforming the face models. Physics-based face models require manually

deciding the values of a large number of physical parameters [77]. FACS is originally

proposed for psychology research [23]. It is more or less subjective and does not provide

quantitative information about face deformation. Users have to manually design AUs for

their face model. MPEG-4 FAPs provide the movement of only some facial features that

can be thought as the control points of the face model. The rest of the face model still has

to be deformed by some predesigned warping/interpolation functions, which should be

addressed by facial deformation control models.

Overall, the problems are related to the facial deformation control models, which either

decide how to change the values of control parameters or are used to design AUs. If the

control model is just used for animation, only the animation results will be affected. The

tracking results will be greatly degraded if either inaccurate control models are used by

the tracking step or the animation results are fed back to the tracking step. Of course, cor-

rupted tracking results will further result in bad animation results. Therefore, research on

facial deformation control model, facial motion analysis, and face animation should be

carried out systematically.

• Speech-driven face animation

A problem with the performance-driven face animation approach is the speed and accu-

racy of its facial motion analysis algorithm. It requires high computation power in order

to obtain robust and accurate facial motion analysis results without putting intrusive

markers on the face of the actor/actress. An alternative way to drive the face model is

speech-driven face animation, which is more efficient than performance-driven face ani-

mation. This kind of approach takes advantage of the tight correlation between speech

and facial coarticulation. It takes speech signals as input and outputs a face animation se-

quence.

7

The audio-to-visual mapping is the main research issue of speech-driven face animation.

The audio information is usually represented as feature vectors of speech, for example,

linear predictive coding (LPC) Cepstrum, Mel-frequency cepstral coefficients (MFCC),

and so on. The visual information is usually represented as the parameters of the facial

deformation control model, for example, the weights of AUs, MPEG-4 FAPs, the coordi-

nates of control vertices of the face model, and so on. The mappings are learned from an

audio-visual training data set, which are collected in the following way. The facial

movements of talking subjects are tracked either manually or automatically. The tracking

results and the associated audio tracks are collected as the audio-visual training data.

Some speech-driven face animation approaches use phonemes or words as intermediate

representations. Lewis [43] used linear prediction to recognize phonemes. The recognized

phonemes are associated with mouth positions to provide keyframes for face animation.

However, the phoneme recognition rate of linear prediction is very low. Video Rewrite

[10] trains hidden Markov models (HMMs) [69] to automatically label phonemes in both

training audio track and new audio track. It models short-term mouth co-articulation us-

ing triphones. The mouth images for a new audio track are generated by reordering the

mouth images in the training footage, which requires a very large database. Video Re-

write is an offline approach and needs large computation resources. Chen and Rao [14]

train HMMs to segment the audio feature vectors of isolated words into state sequences.

Given the trained HMMs, the state probability for each time stamp is evaluated using the

Viterbi algorithm. The estimated visual features of all states can be weighted by the cor-

responding probabilities to obtain the final visual features, which are used for lip anima-

tion.

Another kind of HMM approach tries to map audio patterns to facial motion trajectories.

Voice Puppetry [8] uses an entropy minimization algorithm to train HMMs for the audio

to visual mapping. The mapping estimates a probability distribution over the manifold of

possible facial motions from the audio stream. A globally optimal closed-form solution is

derived to determine the most probable series of facial control parameters, given the be-

ginning and ending values of the parameters. An advantage of this approach is that it does

not require automatically recognizing speech into high-level meaningful symbols (e.g.,

8

phonemes, words), which is very difficult to obtain a high recognition rate. However, this

approach is an offline method.

Other approaches attempt to generate instantaneous lip shapes directly from each audio

frame using vector quantization, Gaussian mixture model, or artificial neural networks

(ANN). Vector quantization [53] is a classification-based audio-to-visual conversion ap-

proach. The audio features are classified into one of a number of classes. Each class is

then mapped onto a corresponding visual output. Though it is computationally efficient,

the vector quantization approach often leads to discontinuous mapping results. The Gaus-

sian mixture approach [70] models the joint probability distribution of the audio-visual

vectors as a Gaussian mixture. Each Gaussian mixture component generates an optimal

linear estimation for a visual feature given an audio feature. The estimations are then

nonlinearly weighted to produce the final visual estimation. The Gaussian mixture ap-

proach produces smoother results than the vector quantization approach. However, nei-

ther approach described in [53] and [70] consider phonetic context information, which is

very important for modeling mouth coarticulation during speech. Moreover, they are lin-

ear mappings while the mapping from audio information to visual information is nonlin-

ear in its nature.

Neural network based approaches try to find nonlinear audio-to-visual mappings. Mor-

ishima and Harashima [54] trained a three layer neural network to map LPC Cepstrum

speech coefficients of one time step speech signals to mouth-shape parameters for five

vowels. Kshirsagar and Magnenat-Thalmann [39] also trained a three-layer neural net-

work to classify speech segments into vowels. Average energy was then used to modulate

the lip shapes of the recognized vowel. Nonetheless, again, the approaches described in

[39] and [54] do not consider phonetic context information, which is very important for

modeling mouth coarticulation during speech. In addition, they mainly consider the

mouth shapes of vowels and neglect the contribution of consonants during speech.

Massaro et al. [50] trained multilayer perceptrons (MLP) to map LPC cepstral parameters

to face animation parameters. They try to model the coarticulation by considering the

speech context information of five backward and five forward time windows. Another

9

way to model speech context information is to use time delay neural networks (TDNNs)

model, which uses ordinary time delays to perform temporal processing. Lavagetto [40]

and Curinga et al. [20] train TDNN to map LPC cepstral coefficients of speech signal to

lip animation parameters. TDNN is chosen because it can model the temporal coarticula-

tion of lips and is more computationally efficient than HMM. Nevertheless, the artificial

neural networks used in [20], [40], [50] require a large number of hidden units, which

results in high computational complexity during the training phrase. Vignoli et al. [81]

use self-organizing maps (SOM) as a classifier to perform vector quantization functions

and feed the classification results to a TDNN. SOM reduces the dimension of input of

TDNN so that it reduces the parameters of TDNN. Therefore, the computational com-

plexity of TDNN is reduced. However, SOM is a hard decision classifier and its recogni-

tion results do not encode mouth coarticulation information. In order to reduce the input

dimension, SOM can only have a few nodes, which results in losing important audio in-

formation.

1.2.3 Facial motion analysis

As is shown in Section 1.2.2, facial motion analysis is very important. The analysis re-

sults can be used to directly drive the face model or train audio-to-visual mappings. There

has been a large amount of work done on facial feature tracking. Simple approaches only

utilize low-level image features. Their computational complexity is low and suitable for

real time tracking tasks. For example, Goto et al. [28] extract edge information to find

salient facial feature regions (eyes, lips, etc). The extracted low-level image features are

compared with templates to estimate the shapes of the facial features.

However, it is not robust enough to use low-level image features alone. The errors will

quickly accumulate with the increase in number of frames being tracked. High-level

knowledge has been used to tackle this problem by imposing constraints on the possible

shapes/deformations of facial features. It has been shown that high-level knowledge is

essential for robust facial motion tracking. The tracking algorithm combines information

derived from low-level image processing and the high-level knowledge model to track

facial features. The high-level knowledge is usually explicitly represented as some kind

10

of shape model for facial features. Different model-based tracking algorithms differenti-

ate themselves by their shape models and their low-level image processing steps. We

summarize different model-based facial feature tracking algorithms according to their

high-level knowledge models as below.

• B-spline curve

Blake et al. [6] proposed parametric B-spline curves for contour tracking. The tracking

problem is to estimate the control points of the B-spline curve so that the B-spline curve

matches the contour being tracked as closely as possible. A Kalman filter is incorporated

to track objects with high contrast edges. However, without global constraints, B-spline

curves tend to match contours locally, resulting in wrong matching among contour points,

which is called the sliding effect and is similar to the aperture problem in optical flow

calculation. The robustness of the algorithm could be improved by employing more sto-

chastic motion models [7]. The motion models can be learned from examples to represent

specific motion patterns. The motion model superimposes a constraint on the possible

solution subspace of the contour points. Therefore, it prevents generating physically im-

possible curves.

However, the absence of a sharp jump or sudden change around the lip boundary makes it

difficult to reliably track lip contours. Instead of using grey-level edge information, Kau-

cic and Blake [37] and Chan [12] utilized the characteristics of human skin color. They

proposed using either Bayesian classification or linear discriminant analysis to distin-

guish lips and other areas of facial skin. Therefore, the contours of the lips can be ex-

tracted more reliably. It is well known that color segmentation is sensitive to lighting

conditions and the effectiveness of color segmentation depends on the subject. This can

be partially solved by training a color classifier for each individual. Nevertheless, the ap-

proaches described in [37] and [12] do not deal with rotation, translation and scaling of

lips.

• Snake

The Snake is first proposed by Kass et al. [36]. It starts from a given starting point and

deforms itself to match with the nearest salient contour. The matching procedure is for-

11

mulated as an energy minimization process. In basic Snake-based tracking, the function

to be minimized includes two energy terms: (1) internal spline energy caused by stretch-

ing and bending, and (2) measure of the attraction of image features such as contours. B-

spline [6] is a “least squares” style Snake algorithm (a Kalman filter). Snakes rely on

gray-level gradient information while measuring the energy terms of the snakes. How-

ever, it is well known that gray-level gradients are inadequate for identifying the outer lip

contour [88]. Therefore, the facial features being tracked are highlighted by makeup in

[77]. Otherwise, Snakes very often align onto undesirable local minima.

To improve Snakes, Bregler and Konig [9] propose eigenlips that incorporate a lip shape

manifold into Snake tracker for lip tracking. The shape manifold is learned from training

sequences of lip shapes. The function of the shape manifold is similar to the stochastic

motion model for B-spline in [7]. It imposes global constraints on the Snake. The local

search for maximum gray-level gradients is guided by the globally learned lip shape

space.

• Deformable template

In [88], a facial feature is defined as a deformable template, which includes a parametric

geometrical model and an imaging model. Deformable template treats tracking as an in-

terpretation by synthesis problem. The geometrical model describes how the shape of the

template can be deformed and is used to measure shape fitness of the template. The imag-

ing model describes how to generate an instance of the template and is used to measure

the intensity fitness of the template. An energy function is designed to link different types

of low-level image features, e.g., intensity, peaks, valleys, and edges, to the correspond-

ing properties of the template. The parameters of the template are calculated by steepest

descent. Both B-spline and Snake can be thought of as special cases of deformable tem-

plate, which only utilize edge information in the image. Nevertheless, the parametric fa-

cial feature models are usually defined subjectively.

• ASM, AAM, and eigen-points

Active Shape model (ASM) [17], Active Appearance model (AAM) [52], and eigen-

points [19] utilize both contour and appearance to model the facial features. The motiva-

12

tions of ASM, AAM, and eigen-points are similar to that of the deformable template [88].

They all treat tracking as an interpretation by synthesis problem. ASM, AAM, and eigen-

points try to achieve robust performance by using the high-level model to constrain solu-

tions to be valid examples of the object being tracked. The appearance of the object is

explained by the high-level model as a compact set of model parameters. The models

used by ASM, AAM, and eigen-points are the eigen-features of the object modeled.

ASM and eigen-points model the shape variation of a set of landmark points and the tex-

ture variation in the areas around landmark points. AAM models the whole shape and the

appearance of the object. All of them require manually labeling training data, which is

labor intensive. In order to handle various lighting conditions, the texture part of the

training data should cover broad enough lighting conditions. Both ASM and AAM rely

on iterative solutions. The eigen-points approach avoids the iterative procedure. Instead,

it estimates the parameters in a sequence of matrix operations: orthogonal project, scal-

ing, and orthogonal projection.

Since all of the three methods model the texture of the object, the user cannot put markers

on the object. The training data need to be carefully labeled so that the correspondences

between the landmarks across training samples are physically correctly established.

• Parametric 3D model

DeCarlo and Mataxas [21] propose an approach that combines a deformable model space

and multiple image cues (optical flow and edge information) to track facial motions. The

edge information used is chosen around certain facial features, such as the boundary of

the lips and eyes, and the top boundary of the eyebrows. To avoid high computation

complexity, optical flow is calculated only for a set of image pixels. Those image pixels

are chosen in the region covered by the face model using the method proposed by Shi and

Tomasi [72]. The deformable model [21] is a parametric geometric mesh model. The pa-

rameters are manually designed based on a system of anthropometric measurements of

the face. By changing the values of the parameters, the user can obtain a different basic

face shape and deform the basic face shape locally.

13

The deformable model [21] helps to prevent producing unlikely facial shapes during

tracking. It does have limitations in its coverage. There are many facial motions that can-

not be represented accurately, for example many of the lip deformations produced during

speech.

• FACS-based 3D model

Some facial motion tracking algorithms design the high-level models based on AUs de-

fined by FACS [25], [44], [75]. The FACS based 3D models impose constraints on the

subspace of the plausible facial shapes. The motion parameters are separated as global

face motion parameters (rotation and translation) and local facial deformation parameters,

which correspond to the weights of AUs in [44], [75] and to the FACS-like control pa-

rameters in [25]. First the movements of the vertices on the model are calculated using

some kinds of optical flow calculation techniques. The optical flow results are usually

noisy. The model is then added to constrain the optical flow. The motion parameters are

calculated by least square estimator.

However, FACS was originally proposed for psychology study and does not provide

quantitative information about facial deformations. To utilize FACS, researchers need to

manually design the parameters of their model to obtain the AUs. In [44], a parametric

geometrical face model, called Candide, was used. The Candide model contains a set of

parameters for controlling facial shape. In [75], Tao and Huang used a piecewise Bezier

volume deformable face model, which can be deformed by changing the coordinates of

the control vertices of the Bezier volumes. In [25], Essa and Pentland extended a mesh

face model, which was developed by Platt and Badler [68], into a topologically invariant

physics-based model by adding anatomy-based “muscles,” which is defined by FACS.

Overall, the basic problem still lies on the facial deformation control models, which ei-

ther decide how to change the values of parameters or are used to design AUs. If the con-

trol model is used just for animation, only the animation results will be affected. The

tracking results will be greatly degraded if either inaccurate control models are used by

the tracking step or the animation results are fed back to the tracking step. Of course, cor-

rupted tracking results will further result in bad animation results.

14

• MPEG-4 FAP-based B-spline surface

Eisert et al. [22] proposed a model-based analysis-synthesis loop to estimate face and fa-

cial motions. The head model is assumed to be a B-spline surface. The shape of the head

model is determined by 231 control points of the B-spline surface. The head model is tex-

ture mapped. They represent the motion and deformation of the 3-D head model by the

facial animation parameters (FAPs) based on the MPEG-4 standard. An intensity gradi-

ent-based approach that exploits the 3-D head model information is used to estimate the

FAPs directly. The problem is also formulated as a least square model fitting problem

given (1) the possible deformations in the head model that can be controlled by the FAPs

and (2) the constraints on the changes in the FAPs between two successive frames. How-

ever, it is an art to design a B-spline surface model so that it can well represent a facial

surface. The problems mainly arise on the number of the control vertices and the loca-

tions of the control points.

Basically, the underlying idea of the approach in [22] is same as those of the approaches

described in [25], [44], [75]. They all form a model-based analysis-synthesis loop except

that their models appear in different forms. Therefore, Eisert et al.’s approach [22] should

face the same problem confronted by the approaches proposed in [25], [44], [75].

• 3D model learned from real data

Some approaches train their 3D models using a set of labeled real facial deformation data

[3], [71]. The approaches presented in [3] and [71] only deal with lips. The trained 3D

model is able to encode the information of real lip deformations. Color classes for the lips

and face are trained to estimate the class probability of each pixel. The tracking problem

is formulated as finding the lip shape within the subspace that maximizes the posterior

probability of the model given the observed color features of lips and facial skin. In [3],

the methods used for estimating the parameters in [3] and [71] are variances of gradient

ascent.

15

1.3 The Approach – An Integrated Framework for Face Modeling, Facial Motion Analysis, and Synthesis

Given such a significant amount of previous research on face modeling, facial motion

analysis, and synthesis, it remains unclear what are the most important aspects that lead

to realistic face animation. This thesis advocates that the research on face modeling, fa-

cial motion analysis, and synthesis should be carried systematically. Both the facial de-

formation model and the coarticulation model should be based on extensive analysis of

real facial movement data. A framework is needed to guide this research. It is the major

contribution of this thesis to present an integrated framework (Figure 1.1) for face model-

ing, facial motion analysis, and synthesis.

A set of MUs is used as the quantitative visual representation of facial deformations. The

same visual representation is used both by face animation and facial motion analysis.

MUs are learned from a set of labeled real facial shapes and are used for face modeling.

Arbitrary facial deformation can be approximated by a linear combination of MUs, which

are weighted by MU parameters (MUPs). We can animate the face model by adjusting

MUPs. It will be shown that the MU-based face animation technique is compatible with

existing popular face animation techniques/standards, such as key frame techniques and

MPEP-4 FAP.

Within this framework, a MU-based facial motion tracking algorithm is presented. MUs

are used as the high-level knowledge by the tracking algorithm to attain robust facial mo-

tion analysis results. The tracking results are represented as a MUP sequence. The track-

ing results can be used directly for face animation or can be used for other train-

ing/recognition purpose.

A set of facial motion tracking results and the corresponding audio tracks are collected as

the audio-visual database. Machine learning techniques are applied to training two real-

time speech-driven face animation algorithms using the collected audio-visual database.

The algorithms map audio features to MUPs, which are used to animate face models via

MU-compatible face animation techniques.

16

In the following three chapters, the details of each part of the framework are presented

and discussed. Experimental results are shown.

Figure 1.1 An integrated framework for face modeling, facial motion analysis, and synthesis.

Motion Units

Labeled facial deformations

Learn facial deformations

MU-based facial motion analysis

Convert speech to MUPs

Face image sequence

Train speech to MUP mapping

Speech stream

Real-time speech to MUP

Mapping

New speech stream

MU-based face animation

MUP sequence

MUP sequence

Video database

Graphic face animation sequence with texture

17

CHAPTER 2

2 MOTION UNITS AND FACE ANIMATION

The framework requires a basic information unit to establish information flow and link

all its active components together.1 This basic information unit is the visual representa-

tion of facial shape and deformation, which has been an important issue since the emer-

gence of computer face animation. The visual representation should be suitable for com-

putation and have sound representation power.

This chapter presents the MU as the quantitative visual representation. MU is inspired by

the AU of the FACS [23]. The main difference is that MUs are learned from real facial

deformation data and encode the characteristics of real facial deformations. Therefore

MUs are more suitable for computing purposes and synthesizing natural facial move-

ments. Currently, most existing facial surface models are mesh models. Therefore, the

appearance of the MU is set as mesh in this thesis. MUs directly model the facial surface

without using other intermediate control model.

2.1 Collect Training Data for Learning Motion Units

We mark sixty-two points around the subject’s lower face (Figure 2.1(a)). The number of

the markers affects the representation capacity of the MU. More markers will enable MU

to encode more information. Depending on the need and context of the system, the user

can flexibly decide the number of the markers while still following the guidance provided

by this framework. The only guideline is to put more markers in the areas where muscle

distributions are more complicated, such as lips.

1 The appearance of the basic information unit or the way to calculate it could be different as long as it is qualified for the information flow connecting the components of the framework.

18

Currently, only 2D motion of the lower face is considered. This is due to the comprehen-

sive consideration on the effectiveness and the required time to develop the demo of the

framework.

Firstly, the lower face contributes to the most complicated part of the facial movements.

This is evidenced by the anatomy of facial muscles (Figure 2.2) and the underlying struc-

ture of the skull. The configuration of the upper facial muscles (forehead and eyelids) is

much simpler than that of the lower facial muscle. Therefore, the upper face can only de-

form in a simpler way. The natural movements of the eyelids include only open and

close. The only movable part of the skull is the jaw, whose movements affect only the

deformations of the lower facial surface. Therefore, the lower face can have more com-

plicated deformations than the upper face.

Secondly, as long as facial motion during speaking is considered, the movement of the

lower face and that of the upper face are independent. If the expressions are considered,

the training data should cover the movements of the upper face. If the face model to be

animated is always facing the user without turning around, 3D deformation of the face

model will not be an issue. Even though 3D deformation must be considered, it will be

shown in Section 2.4 that 2D MUs can be used to infer the deformation along the third

axis.

Figure 2.1 An example of the labeled data and the mesh model.

(a) Markers (b) Mesh

19

Thirdly, the facial motions in the lower face have little influence on the facial motions in

the upper face and vice versa [23]. Hence, we can treat them separately.

Future work will have more markers and will cover 3D motion of the whole face. The

same framework can be followed. It needs to be emphasized again here. The contribution

of this framework is to provide a systematic guideline for building a face modeling and

animation system. It is the spirit of the framework to be transmitted and propagated.

A mesh model is created according to those markers (Figure 2.1(b)). The lines among

those vertices are just for visualization purposes now. Exploring the adjacent relations

among those points, which are represented by those lines, is an interesting research topic

that can be investigated in the future. This mesh model is further used in a facial motion

tracking algorithm which will be described in Chapter 3. The subject is asked to wear a

pair of glasses where two additional points are marked. Since the glasses only undergo

rigid motion, those two points on the glasses can be used for data alignment.

Figure 2.2 Facial muscles.2

2 Source: http://predator.pnb.uconn.edu/beta/virtualtemp/muscle/Muscle-Anatomy-Pages/Anatomy-Pages/anatomy-facial.html

20

We attempt to include as great a variety of facial deformations as possible in the training

data and capture video of facial movements for pronouncing all English phonemes. The

video is digitized at 30 frames per second, which results in over 1000 samples. The

markers are automatically tracked by zero-mean normalized cross-correlation template

matching technique [27]. A graphic interactive interface is developed for the user to cor-

rect the positions of trackers when the template matching fails due to large face or facial

movements. In that interface, each tracker corresponds to a vertex on the mesh. The use

can use a mouse to drag the vertices of the mesh and consequently change the positions

of the trackers. The tracking results are aligned by rotation, scaling, and translation so

that the two markers on the glasses are coincident for all the data samples.

2.2 Learning Motion Units

Principal component analysis (PCA) [34] has been extensively used to model the signifi-

cant characteristics of the samples [38], [51], [79]. In this work, PCA is also used to learn

a set of MUs that span the facial deformation space. Although lip shapes may differ from

person to person, we hope that the deformation space is more consistent.

A data sample in the training set }{sS r= is represented as a vector Tnn xxyxs ],,,,[ 11 L

r =

(n = 62), which is formed by concatenating the coordinates of the markers after normali-

zation. Let 0sr be the neutral facial shape. The deformation vector of each data sample is

calculated as 0ssd iirrr

−= , i = 1, …, , P, where M is the size of the training data set. In

this way, we obtain the deformation vector set of the training data set as

},,{ 1 PddDr

Lr

= .

The mean and the covariance matrix of D are calculated by ][0 idEmr

= and

]))([( 00T

ii mdmdE rrrr−−=Σ . The eigenvectors and eigenvalues of Σ are calculated. The

first K (in our case, K = 7) significant eigenvectors 1mr , 2mr , …, Kmr , which correspond to

the largest K eigenvalues, are selected. They account for 97.56% of the facial deforma-

tion variation in the training data set. More eigenvectors can be chosen. However, in this

case, the representation power of chosen eigenvectors, in term of the preserved facial de-

21

formation variation, increases little while the number of chosen eigenvectors increases

over 7.

We call },,,{ 10 Kmmm rK

rr the MU set. The chosen MUs are illustrated in Figure 2.3. Each

mesh in Figure 2.3 is derived by 0sr + ρ imr (ρ = 25). They respectively represent the mean

deformation and local deformations around lips, mouth corners, and cheeks.

Any facial shape sr and the corresponding deformation vector dr

can be represented by

dss

mcmdK

iii

rrr

rrr

+=

+= ∑=

0

10 (2.1)

where {ci} is the MU parameter set and iT

i mmssc rrrr )( 00 −−= , i = 1, …, K. By adjusting

ci, we can obtain different facial shapes in the space defined by MUs.

MU is related to the eigen-model in our previous work [30], [31], the ASM [17], AAM

[18], eigenlips [9], and eigen-points [19]. All the above visual representations are learned

using PCA and are applied to modeling facial features. The shared underlying assumption

is that the distribution of facial (or facial feature) deformations/shapes or appearance is

Gaussian. Any instance can be approximated by a linear combination of some bases

learned by PCA. In [30] and [31], the PCA learning results are further used to synthesize

new mouth movement sequences. Table 2.1 lists the properties of MU, ASM, AAM, ei-

genlips, and eigen-points. MU does not model the facial appearance. Modeling appear-

ance is very difficult. In order to handle various lighting conditions and races, the texture

part of the training data should cover broad enough samples, which require intensive

manual work. Moreover, such a large size of texture training database will be beyond the

modeling capacity of PCA because PCA assumes the training data are Gaussian.

22

Figure 2.3 MUs

(a) MU 0 (b) MU 1

(d) MU 3 (c) MU 2

(e) MU 4 (f) MU 5

(h) MU 7 (g) MU 6

23

Table 2.1 MU, ASM, AAM, eigenlips and eigen-points.

MU eigenlips eigen-points ASM AAM Learned from real data

Yes Yes Yes Yes Yes

Modeling the shape of the object

Yes No Yes Yes Yes

Geometric appearance of the model

Triangular mesh

Snake curves

(contours of the lips)

Point cloud Curves which consist of land

markers around the

contour of fa-cial features

Same to ASM

Modeling ap-pearance of the object No

Model whole lips

Model the texture in small regions around each feature point.

Model the tex-ture in small regions around each landmark point.

Model the global ap-pearance

Modeling the joint varia-tion of the appearance and the shape

No No Yes No No

MU is also related to the lip models reported in [2] and [71]. The lip model in [71] has 30

control points. The remaining vertices are generated by cubic interpolation curves, which

introduce artificial effects. Moreover, the training data in [71] are collected by subjec-

tively manually adjusting the control points. The training data set of MU and that in [2]

provide ground truth information because they are collected by putting markers on the

face. However, [2] uses a complicated physics-based control model that increases the

computational complexity. As we will show in the rest of this chapter, it is efficient and

sufficient to directly model facial deformation for animation purposes.

In fact, MUs can be used to deform the control points of facial deformation control mod-

els by designing MUs to include elements that correspond to those control points. This

24

can be easily achieved by making the marker set on the face of the subject cover those

control points while the training data of MUs is collected. An advantage of MU is that the

way to calculate MUs explores the correlations among markers. Since MUs will be used

in facial motion tracking later, it is better to let MUs have more points than the number of

the control points. This is because the number of control points on a face model is usually

too small. It is difficult to achieve robust facial motion analysis by only tracking a small

number of points because not all points can be independently tracked accurately enough.

This will be illustrated in the MU-base facial motion tracking algorithm, which is de-

scribed in Chapter 3.

2.3 Use MUs to Animate Face Model

MUs have many good properties. Firstly, MUs are learned from real data and encode the

characteristics of facial deformations. Secondly, compared to the number of the vertices

on the face model, the number of the MUs is much smaller. Since any 2D facial deforma-

tion can be represented by a linear weighted combination of MUs, we only need to adjust

a few parameters in order to animate the face model. This dramatically reduces the com-

plexity of face animation and makes MUs especially suitable for the facial motion track-

ing algorithm, which will be described in Chapter 4. In addition, MUs are orthogonal to

each other. Therefore, it is computationally efficient to calculate MUPs for any facial de-

formation.

It will be shown below that the MU-based face animation technique is compatible many

with existing face animation techniques. This is very important from both the academic

research point of view and the industrial point of view. Key frame techniques3 and the

MPEG-4 face animation standard are widely used in existing face animation systems. We

will show that it only requires simple matrix operations to achieve: (1) the conversions

between MUPs and key frame parameters; and (2) the conversions between MUPs and

the MPEG-4 FAPs. The techniques that will be presented in Sections 2.3.1 and 2.3.2 en-

able users to clone facial motion while using different face models.

3 Only those key frame techniques that use linear combination of key frames are considered here.

25

The major advantage of MU over Key-frame and MPEG-4 FAPs is that MU is supported

by real facial movements. More precisely, the spatial and temporal characteristics of real

face animation can be encoded in MUs and MUP sequence, which are derived from real

facial movements. Key-frames do contain detailed spatial facial deformation information.

However, there is no theoretical guidance for temporally adjusting the key-frame parame-

ters to achieve natural face animation. MPEG-4 face animation standard also has the

same problem.

2.3.1 MU and key frame

Key frame approaches animate face models by linearly combining a set of key frames,

say α key frames. Without losing generality, we assume that each key frame represents

facial deformation. We can select a set of facial shapes },,{ 1 αkkr

Kr

in the training data

set of MUs, so that there is a correspondence between },,{ 1 αkkr

Kr

and the key frames.

Since those key frames usually correspond to a set of meaningful facial shapes (e.g.,

laughing, smiling, visemes, and so on), it is easy to choose },,{ 1 αkkr

Kr

from the training

set of MUs. A facial deformation dr

in the animation sequence can be represented as

∑ =

α

1i ii kar

, and dr

can also be represented by MU as ∑ =

K

i ii mc1

r . The conversion between

ci and ai can be easily achieved by

=

α

α

a

akkmm

c

cT

K

k

Mr

Lrr

Lr

M1

11

1

][][ (2.2)

and

ΓΓΓ=

−

k

KTT

c

cmm

a

aM

rL

rM

1

11

1

][)(

α

, where ][ 1 αkkr

Lr

=Γ (2.3)

Our iFACE [32] system, which will be presented in Chapter 5, was first developed to use

the key-frame technique to animate the face model because the simplicity of the key-

26

frame technique. Equation (2.3) enables us to easily modify the iFACE system to adopt

MUs for face animation.

2.3.2 MU and MPEG-4 FAP

The MPEG-4 standard defines 68 FAPs. Among them, two FAPs are high-level parame-

ters (viseme and expression), and the others are low-level parameters that describe the

movements of facial features (see Figure 2.4) defined over jaw, lips, eyes, mouth, nose,

cheek, ears, and so on [92].

Figure 2.4 MPEG-4 feature points.

The movement represented by each FAP is defined with respect to a neutral face and is

expressed in terms of the FAP units (FAPUs) (Figure 2.5). The FAPUs correspond to

fractions of the distances between a set of salient face features, such as eye separation,

27

mouth-nose separation, etc. These units are defined in order to allow a consistent inter-

pretation of FAPs on any face model.

Figure 2.5 The facial animation parameter units.

The high-level parameters of MPEG-4 FAP describe visemes and expressions, but do not

provide temporal information. The low-level parameters of MPEG-4 FAP can represent

the temporal information by the change of values of the parameters. However, they only

describe the movements of 66 facial features and lack detailed spatial information to

animate the whole face model. Most people use some kind of interpolation functions to

animate the rest of the face model.

MUs are learned from real facial movements. The advantages of MUs over MPEG-4 FAP

include the following: (1) MUs encode detailed spatial information for animating a face

model, and (2) real facial movements can be easily encoded as MUPs using Eq. (2.1) for

animating face models.

If the values of MUPs are known, the facial deformation can be calculated. Consequently,

the movements of facial features used by MPEG-4 FAPs can be calculated. It is then

straightforward to calculate the values of MPEG-4 FAPs. On the other hand, if the values

of MPEG-4 FAPs are known, we can calculate MUPs in the following way. First, the

movements of the facial features are calculated. The concatenation of the facial feature

movements forms a vector pr . Then, we can form a set of vectors, say },...,,{ 21 Kfffrrr

, by

ESo

ENSo

MNSo

MWo

28

extracting the elements corresponding to those facial features from the MUs

},,{ 1 Kmm rK

r . The vector elements of },...,,{ 21 Kfffrrr

and those of pr are arranged so that

the information of facial features is represented in the same order in the vectors. The

MUPs can be calculated by:

)()( 01

1

mpFFFc

cTT

k

rrM −=

− , where [ ]KffF

rL

r1= (2.4)

The markers must include those facial feature points used by MPEG-4 FAP to enable the

conversion between MUP and FAP. The facial movement defined by MPEG-4 FAP must

be valid facial movement or natural facial deformation because MUs are learned from

natural facial movements. If the user intends to use MPEG-4 FAP to describe exagger-

ated facial deformation, MUPs can still be obtained with respect to least square error.

However, the reconstructed facial shape using MUs and MUPs might be subject to unde-

sired artifacts.

2.4 Discussions

Equations (2.2), (2.3) and (2.4) are irrelevant to the dimension of the MUs. They can be

applied to 2D or 3D MUs. Interestingly, 2D MUs can be used to infer 3D facial deforma-

tion using Eq. (2.3). The conversion between MUPs and key frame parameters is based

on the high-level concept correspondences. More exactly, the correspondence is estab-

lished on the level of the whole face instead of that of individual facial points. Though

only 2D information is used in Eq. (2.3), the results enable 3D facial deformation if the

key frames are 3D because the facial deformation is finally expressed as a weighted com-

bination of the key frames. In the other words, Eq. (2.3) provides a way to infer 3D facial

deformation information from that in 2D.

To note, the conversions defined by Eqs. (2.2), (2.3), and (2.4) are not lossless ones. In-

stead, they try to preserve as much information as possible with respect to the mean

square error between the facial shape using the original representation and the one using

the new representation, to which the original representation can be converted. Since MUs

29

are designed to preserve the variances of real facial deformations as much as possible, the

overall information preserved by using MU as the representation should be the closest to

the real facial deformations. This is also the reason MUs rather than key frames/MPEG-4

FAPs are used as the quantitative visual representation for facial deformations in this the-

sis. However, not all the shapes of MUs are visually meaningful to human beings. There-

fore, it might be difficult for some users to directly use MUs. The user can start with key

frames, which is more visually meaningful, and using the above conversion technique

while building the face modeling and animation system.

30

CHAPTER 3

3 MU-BASED FACIAL MOTION TRACKING

In this chapter, we present and discuss the MU-based facial motion tracking algorithm.

As has been stated before in this thesis, the tracking results can be used directly for face

animation or training speech-driven face animation. The MU-based facial motion track-

ing algorithm covers the cheeks, which is very important for synthesizing visual speech.

3.1 Model Initialization

The MU-based tracking algorithm requires that the face be in its neutral position in the

first image frame. The tracking algorithm uses a generic mesh, which is the one shown in

Figure 2.1(b) while without the two points on the glasses. Choosing this mesh makes it

possible for the tracking algorithm to utilize MUs learned in Chapter 2. The generic mesh

has two vertices representing two mouth corners. The user manually selects two mouth

corners in the image. The generic mesh model is fitted to the face by scaling and rotating

so that the mouth corner vertices of the mesh are coincident with the selected points. An

example of mouth corner selection and mesh model fitting is show in Figure 3.1.

Figure 3.1 Model initialization for tracking.

(a) Select two mouth corners (b) Fitting results

31

3.2 Tracking as a Weighted Least Square Fitting Problem

The MU-based facial motion tracking algorithm consists of two steps. The first step is a

low-level image processing step, which conducts modelless tracking for the vertices of

the mesh. The potential locations of the feature points in the next image are calculated

separately by zero-mean normalized cross-correlation template matching [27]. Template

matching techniques can handle gradual changes of lighting. However, the template

matching results are usually noisy. In the second step, the high-level knowledge encoded

in MUs is added to constrain the results calculated in the first step. The template match-

ing results are converted into MUPs and global face motion parameters, which are high-

level control information and can be used directly for MU-compatible face animation.

The MU-based tracking algorithm is a 2D facial motion tracking algorithm because it

uses a 2D MU model. The algorithm further assumes that the geometric imaging model

of the face in the camera is an affine model. This assumption stands when the following

two conditions are fulfilled: (a) the distance between the camera of the face is much lar-

ger than the depth of the face (the size of the face along the view line of the camera); (2)

the face does not involve large rotation in the directions that are in/out the image plane.

3.2.1 Modelless tracking

First, the template matching step tracks vertex i of the mesh by tracking its corresponding

facial point according to the coordinates of vertex i. The facial point is also denoted as

point i for convenience. The term ),( ,1 t

jiti ss −Γ is defined as the zero-mean normalized

cross-correlation operator, where 1−tis is the template of the facial point i in the image

frame at time t-1, facial point j is one of the candidate points of point i in the image fame

at time t, and tjis , is the template of point j. According to the definition of zero-mean

normalized cross correlation, we have 1),(1 ,1 ≤Γ≤− − t

jiti ss . The similarity between point i

and point j is defined as

)1),(exp(),( ,1 −Γ= − t

jiti ssjiϖ (3.1)

32

The template matching step searches locally around point i in the next image frame. Usu-

ally, the search range is a w × h window that centers at point i. A point i* that best match

point i based on the criterion defined by Eq. (4.1) is selected, i.e., i* = ),(maxarg jij

ϖ .

The coordinate vector of the point i* in the image plane at time t is denoted as Tt

it

it

i yxv ],[ )()()( =r . Correspondingly, the similarity between i* and i is denoted by )(tiw . The

information that is fed into the next step includes )(tivr and )(t

iw .

3.2.2 Constrained by MUs

MUs are used to constrain the information obtained in the first step. Mathematically, the

tracking problem can then be formulated as a minimization problem

∑∑

∑

∑

=

=

=

=

−

+

+

+

=

−+Ψ=

n

it

i

ti

i

k

pipp

i

k

pipp

ti

CT

ni

tii

ti

CT

t

yx

tt

ymc

xmc

tttt

w

vCMwCT

1

2

)(

)(

6

3

)0(

02,

)0(

01,

54

21)(

,

2

,1

)()(

,

)(**

minarg

)(minarg),(

rr

rr

rrrrrζ

(3.2)

where:

(a) n is the number of vertices on the mesh model.

(b) Ψ(•) is the affine transformation function, whose parameter set TttttttT ],,,,,[ 654321=

r describes the global 2D rotation, scaling, and translation

transformations of the face. Ψ(•)i denotes the image coordinate vector of the ith

vertex after being transformed by Ψ(•).

(c) ][ 10 KmmmM rL

rr= and Tnpnpppp mmmmm ][ 2,1,12,11, L

r = (p = 0, …, K).

And Tpp mm ][ 12,11, represent the deformation characteristics of vertex i encoded

in pmr .

33

(d) TKcccC ][ 10 L

r= is the MUP vector and c0, c1, …, cK are the MUPs. Since

0mr is the mean deformation, c0 is a constant and is always equal to 1.

(e) Tnn yxyx ][ )0()0()0(

1)0(

1 Lr

=ζ represents the concatenation of the coordinates of

the vertices in their initial positions (or the neutral position) in the image plane.

(f) iCM )( ζrr

+Ψ represents the plausible coordinate of vertex i in the manifold de-

fined by MU with respect to Tr

and Cr

.

The unknown parameter set consists of TttttttT ],,,,,[ 654321=r

and TKccC ],,[ 1 K

r= .

The intuition of Eq. (3.2) is to find a set of motion parameters to minimize the mean

square error between the template matching results and the instance generated by the

high-level knowledge model using parameter Tr

and Cr

.

After rearranging matrix elements, Eq. (3.2) can be rewritten as

2

10

2*

][minarg

minarg

bqWAAA

bqAq

Kq

q

rrL

rrr

r

r

−=

−= (3.3)

where

++++

++++

=

)()(0000)()(

)()(0000)()(

)0(2,0

)(01,0

)(

02,0

)(01,0

)(

)0(112,0

)(1

)0(111,0

)(1

)0(112,0

)(1

)0(111,0

)(1

0

nnt

nnnt

n

nnt

nnnt

n

tt

tt

ymwxmwymwxmw

ymwxmwymwxmw

A LLLL

34

)1( ,

0000

0000

2,)(

1,)(

2,)(

1,)(

12,)(

111,)(

1

12,)(

111,)(

1

≥≥

= iK

mwmwmwmw

mwmwmwmw

A

nit

nnit

n

nit

nnit

n

it

it

it

it

i LLLL

=

)(

)(

)(1

)(1

00

00

tn

tn

t

t

ww

ww

W LL

Ttn

tn

tn

tn

tttt ywxwywxwb ][ )()()()()(1

)(1

)(1

)(1 L

r=

TKKKK tttctctctctctctctcttttq ][ 635421514121115421 L

r =

The least square estimator can be used to solve qr from Eq. (3.3). It is easy to first re-

cover t1, t2, t3, t4, t5, t6 from qr , and then calculate c1, …, cK.

3.3 Improving the MU-based Facial Motion Tracking Algorithm

The facial points that correspond to the vertices of the mesh may not have good texture

properties for tracking. This kind of point is called a bad feature point. It is very difficult

to accurately track those bad points. If there are many bad points being tracked, the tem-

plate-matching step will generate highly corrupted information. High-quality tracking

results cannot be guaranteed if the highly corrupted information is fed into the second

step and is combined with the information provided by the high-level knowledge model.

The error will accumulate very fast and lead to losing track quickly.

A heuristic method is used to improve it. The purpose is to make the information calcu-

lated in the low-level image processing step more reliable. For each facial point being

tracked, an image pixel with good texture properties is selected from a 3 x 3 window cen-

tering at the mesh vertex. The selected good image pixels are tracked across two consecu-

tive frames. Since the selected good image pixel is very close to its vertex in the image

35

(the maximum distance is one pixel), we can assume that the spatial relation between

them remains unchanged across two consecutive frames. Thus, we assign the displace-

ment of the good image pixel to that of its correspondent mesh vertex. Then, the remain-

ing of the calculation is exactly the same as Eq. (3.2).

The Kanade-Lucas-Tomasi (KLT) feature tracker is used to select and track good image

pixels. KLT was originally proposed by Lucas and Kanade [46] and was further devel-

oped by Tomasi and Kanade [78]. Readers should refer to [72] for details. Good features

are selected by examining the minimum eigenvalue of each 2 × 2 gradient matrix. Image

pixels are tracked using a Newton-Raphson procedure to minimize the difference be-

tween two images. Multiresolution tracking allows for large displacements between im-

ages. The accuracy of the tracking is up to the subpixel level. Intel’s Microprocessor Re-

search Lab implements the KLT tracker and makes it publicly available in OpenCV.1

3.4 Experimental Results2

In Figure 3.2, given the same initialization and image sequence, the performances of

three methods – the MU-based facial motion tracking algorithm, template matching using

zero-mean normalized correlation, and KLT tracker – are compared. The testing video is

captured using a Panasonic AG-7450 portable video cassette recorder and is digitized at

30 frames per second (fps). All three methods are implemented in C++ on Windows 2000

and run at 30+ fps. The tracking results are shown as the white mesh overlapping on the

face. The images are shown from left to right with the time increases.

The images at the top, middle, and bottom rows of Figure 3.2 are the tracking results of

template matching, the KLT tracker, and MU-based tracking algorithms respectively. As

is illustrated, the error of template matching accumulates quickly and eventually makes

tracking fail. The KLT tracker works better while still losing track of some points, such

1 The software is found at http://www.intel.com/research/mrl/research/opencv/.

2 The video clips showing the tracking results can be found at: http://www.ifp.uiuc.edu/~hong/Research/mouth_tracking.htm.

http://flamingo.stanford.edu/users/tomasi/bio.html

as some points in the cheeks. The MU-based tracking algorithm works best because MUs

provide good constraints and make tracking more robust.

(a) The 60th frame (b) The 160th frame (c) The 226th frame (d) The 280th frame

36

Figure 3.2 Comparison of the tracking results on an unmarked face using the MU-based facial motion tracking algorithm, template matching, and the KLT trackers. Only the tracking results of some typical mouth shapes in the test sequence are shown. The top row shows the tracking results using template matching. The mid-dle row shows the tracking results using template matching alone. The bottom row shows the tracking results of the MU-based facial motion tracking algorithm.

In the sequence shown in Figure 3.3, the face has global 3D motion, which can be noticed

by using the glasses as the reference. The face also has small motion in/out of the image

plane. The MU-based tracking algorithm assumes affine projection. Therefore it can han-

37

dle this kind of case. However, to handle large global 3D motion, it requires 3D MUs and

perspective projection assumption.

All three tracking methods are applied to a face image sequence, in which the face is

marked. The makers provide ground truth that can be used for comparison. The initializa-

tions of all three methods are the same. The vertices are manually and carefully placed at

the centers of the markers. The tracking results are compared in Figure 3.3 (see page 38).

The images in the top, middle and bottom rows of Figure 3.3 are the tracking results of

template matching, the KLT tracker, and the MU-based tracking algorithm respectively.

As the results show, even on the face with markers, which provide salient features, tem-

plate matching does not work well. This is because some templates change greatly and

suddenly while the facial surface deforms. The KLT tracker works much better, but still

loses track of some points and results in irregular local structures. This can be observed

by looking at the tracking results on the upper lip. Again, the MU-based tracking algo-

rithm works best because it uses MUs to adjust bad tracking results.

Besides robustness, the MU-based tracking algorithm explains the face motion and facial

motion into the parameters of affine transformation and MUPs, which can be used di-

rectly for face animation and beyond the capability of template matching technique and

the KLT tracker.

3.5 Discussions

So far, we only focus on the lower face because MUs are currently designed to cover the

lower face only. The same method can be extended to track the whole face by expanding

MUs to cover the whole face. Besides, the algorithm described in Section 3.2 and the

proposed algorithms, which will be described in Section 3.6, 3.7, and 3.8, are very gen-

eral and can be applied to other objects, for example, human body. MUs are learned from

the training data of an individual. To achieve better generalization performance, MUs

should be learned from the training data of multiple subjects from different age ranges

and races.

If 3D MUs are available, we can modify the 2D MU-based tracking algorithm described

in Section 3.2 and get a 3D MU-based tracking algorithm which has the same calculation

procedure and similar forms of the equations. The new 3D MU-based facial motion track-

ing algorithm is theoretically described in Section 3.6. It can handle both the global 3D

motion of the face and the local 3D facial motions. The theory of a 3D MU-based track-

ing algorithm using multiple cameras is also developed and described in Section 3.7. Us-

ing multiple cameras will capture more information, which can be used to make the

tracking algorithm more robust.

38

Figure 3.3 Comparison of the tracking results on a marked face using the MU-based facial motion tracking algorithm, template matching, and the KLT tracker. Only some typical tracking results in the test sequence are shown. The top row shows the tracking results using the MU-based facial motion tracking algorithm. The middle row shows the tracking results using template matching alone. The bot-tom row shows the tracking results using the KLT tracker alone.

(a) The 36th frame (b) The 67th frame (c) The 108th frame (d) The 170th frame

39

The low-level image processing of our MU-based facial motion tracking is different from

B-spline based approaches [6], [12], [37], Snake based approaches [9], [36], [77], or de-

formable template approach [88]. While the approaches in [6], [9], [12], [36], [37], [77],

and [88] rely on color segmentation, gradients, or edges, which is sensitive to lighting

conditions and depends on the color properties of the subjects, we select good feature

points for reliable tracking. Good feature points enable us to track the movements of

cheeks, where edges can hardly be found and color segmentation will fail. The high-level

knowledge of our approach is also different from theirs. While the approaches described

in [9] and [36] learn high-level knowledge from real data, other approaches [6], [12],

[37], [77], [88] define high-level knowledge subjectively.

The MU-based facial motion tracking algorithm only models shape information. ASM

[17], AAM [52], and eigen-points [19] model both shape and appearance. Appearance, as

another image cue, provides extra information. The question is how to model and use it.

To handle various lighting conditions, the texture part of the training data should cover

broad enough conditions. To collect training data for texture, the face of the subject can-

not be marked. Thus extra care has to be taken while collecting the training data for

ASM, AAM, and eigen-points, so that the landmark points selected in different image

frames are physically the same. If the correspondences cannot be guaranteed, the training

data is biased.

The low-level image processing method used in our approach is similar to those used in

[21], [22], [25], [44], and [75]. The approaches described in [21], [22], [25], [44], and

[75] are 3D facial motion tracking algorithm. The MU-based facial motion tracking algo-

rithm is a 2D approach. However, the high-level knowledge used in [21], [22], [25], [44],

and [75] is either hand-tuned or subjectively defined as some forms of functions. MUs

are learned for real-facial deformation data. If 3D real facial deformation training data are

available, we can extend our approach to 3D facial motion tracking.

The high-level knowledge used by Basu et al. [3] and that used in our approach are

learned from real data. There are four main differences between these two approaches.

The first difference is the facial areas covered by the tracking algorithm. The MU-based

40

tracking algorithm covers cheeks, which are very important for both face animation and

human speech perception. The approach described in [3] only deals with lips. Second, the

low-level image processing steps are different. Basu et al. [3] use color information to

distinguish lips and other areas of facial skin. We track feature points. Thirdly, the ap-

proach proposed by Basu and Pentland track 3D lip motion. Currently, MU is 2D so that

MU-based facial motion tracking can only track 2D facial motion with respect to affine

projection. Finally, Basu et al. [3] use a complicated physics-based lip model. We use a

simple geometric mesh to directly model the face surface without any control model.

Thus we have less computational complexity.

3.6 3D MU-based Facial Motion Tracking

Assuming MUs are 3D and affine camera models [56], the mathematical representation

of the tracking problem using one camera can be written as

∑

∑

∑

∑

∑

=

=

=

=

=

−

+

+

+

+

=

−+Ψ=

nit

i

ti

i

K

pipp

i

K

pipp

i

K

pipp

ti

CT

ni

tii

ti

CT

t

yx

tt

mc

mc

mc

tttttt

w

vCMwCT

,1

2

)(

)(

24

14

30

3,

20

2,

10

1,

232221

131211)(

,

2

,1

)()(

,

)(**

minarg

)(minarg),(

ϕ

ϕ

ϕ

ζ

rr

rr

rrrrr

(3.4)

where:

(a) n is the number of vertices on the mesh model.

(b) Ψ(•) is the projection function of the affine camera. Its parameters include t11, t12,

t13, t14, t21, t22, t23, t24, which describe global 3D rotation, scaling and translation

transformations of the face. Ψ(•)i denotes the coordinate vector of the ith vertex

after being transformed by Ψ(•).

41

(c) ][ 10 KmmmM rL

rr= and [ ]Tnpnpnppppp mmmmmmm 3,2,1.13,12,11, Lr = (p = 0,

…, K). And Tppp mmm ][ 13,12,11, represent the deformation characteristics of ver-

tex i encoded in pmr .

(d) TKcccC ][ 10 L



(e) Tnnn ][ 321131211 ϕϕϕϕϕϕζ L

r= represents the concatenation of the coordi-

nates of the vertices in their initial positions (or the neutral position) relative to the

camera. Tiii ][ 321 ϕϕϕ represents the coordinate vector of vertex i at its neutral

position. In contrast to the 2D MU-based facial motion tracking algorithm, 3D

MU-based tracking requires the initialization of the face model be done in 3D in-

stead of the 2D image plane. This can be done if the camera is calibrated and the

actual size of the face is known.

(f) iCM )( ζrr



and Cr

.

The unknown parameter set consists of [ ]TttttttttT 2423222114131211=r

and

TKccC ],,[ 1 K

r= .

Eq. (3.4) can be rewritten as

[ ] 2

2112110201

2*

minarg

minarg

bqWAAAAAA

bqAq

KKq

q

rrL

rrr

r

r

−=

−= (3.5)

where:

42

+++

+++

=

000)()()(

000)()()(

33,0)(

22,0)(

11,0)(

313,0)(

1212,0)(

1111,0)(

1

01

nnt

nnnt

nnnt

n

nt

nt

nt

mwmwmw

mwmwmw

Aϕϕϕ

ϕϕϕ

LLL

+++

+++=

)()()(000

)()()(000

33,0)(

22,0)(

11,0)(

313,0)(

1212,0)(

1111,0)(

1

02

nnt

nnnt

nnnt

n

nt

nt

nt

mwmwmw

mwmwmwA

ϕϕϕ

ϕϕϕLLL

=

000

000

3,)(

2,)(

1,)(

13,)(

112,)(

111,)(

1

1

nit

nnit

nnit

n

it

it

it

i

mwmwmw

mwmwmw

A LLL (K ≥ i ≥ 1)

=

3,)(

2,)(

1,)(

13,)(

112,)(

111,)(

1

2

000

000

nit

nnit

nnit

n

it

it

it

i

mwmwmw

mwmwmwA LLL (K ≥ i ≥ 1)

=

)(

)(

)(1

)(1

00

00

tn

tn

t

t

ww

ww

W LL

Ttn

tn

tn

tn


)(1

)(1

)(1 L

r=

43

[ ][ ][ ]

[ ]TK

Tiiiiiii

T

TTK

TK

TT

ttq

iKtctctctctctcq

ttttttq

qqqqq

24141

232221131211

2322211312110

110

)1(,

=

≥≥=

=

=

+

+

r

r

r

rrL

rrr

Use a least square estimator to solve Eq. (3.5) and get qr . It is then easy to get

[ ]2423222114131211 ttttttttT =r

and ],,[ 1 KccC Kr

= from qr . Given a calibrated camera,

the pose information about the face can be calculated from Tr

.

The forms of Eqs. (3.4) and (3.5) are similar to those of Eqs. (3.2) and (3.3). Therefore,

the programming written for the 2D MU-based tracking algorithm can be easily modified

for the 3D MU-based tracking algorithm.

Though perspective projections give accurate models for a wide range of existing cam-

eras, the mapping from an object point to the image point is nonlinear. In order to make

the projection model more mathematically tractable, affine cameras are used. The affine

camera is a first-order approximation obtained from the Taylor expansion of the perspec-

tive camera model. If the affine camera model is used, the mapping from an object point

to the image point is linear. The assumption of affine camera model will work well when

the size of the face are relatively much smaller than the distance between the head and the

camera. If the affine camera is calibrated, we recover the true 3D facial motions.

3.7 3D MU-based Facial Motion Tracking Using Multiple Cameras

If multiple synchronized cameras are used, Eq. (3.4) can be easily modified to take ad-

vantage of the information captured by those cameras. The details are shown as below.

44

∑∑∑ ∑

∑ ∑

∑∑

∑

∑

∑

∑∑

= =

= =

= =

= =

=

=

=

= =

−

++

++=

−

+

+

+

+

=

−+Ψ=

Q

j

n

itij

tij

js

isjsj

K

pisppsj

js

isjsj

K

pisppsj

tij

CT

Q

j

n

itij

tij

j

j

ij

K

pipp

ij

K

pipp

ij

K

pipp

jjj

jjjtij

CT

Q

j

n

i

tijijj

tij

CT

t

yx

ttmct

ttmctw

yx

tt

mc

mc

mc

tttttt

w

vCMwCT

1 1

2

)(,

)(,

24,

3

1,2,

0,2,

14,

3

1,1,

0,1,

)(,

,

1 1

2

)(,

)(,

24,

14,

3,0

3,

2,0

2,

1,0

1,

23,22,21,

13,12,11,)(,

,

1 1

2)(,

)(,

,

)(**

)(

)(minarg

minarg

)(minarg),(

ϕ

ϕ

ϕ

ϕ

ϕ

ζ

rr

rr

rr

rrrrr

(3.6)

where:

(a) Q is the number of the cameras, and n is the number of vertices on the mesh

model. The index j is used to denote camera.

(b) Ψj(•) is the projection function of the affine camera j. Its parameters include tj,11,

tj,12, tj,13, tj,14, tj,21, tj,22, tj,23, tj,24, which describe the global 3D rotation, scaling, and

translation transformations of the face with respect to camera j. Ψj(•)i denotes the

coordinate vector of the ith vertex after being transformed by Ψj(•) with respect to

camera j.

(c) ][ 10 KmmmM rL

rr= and [ ]Tnpnpnppppp mmmmmmm 3,2,1.13,12,11, Lr = (p = 0,

…, K). And Tppp mmm ][ 13,12,11, represent the deformation characteristics of ver-

tex i encoded in pmr .

(d) TKcccC ][ 10 L



(e) Tnjnjnjjjjj ][ 3,2,1,13,12,11, ϕϕϕϕϕϕζ L

r= represents the concatenation of the

coordinates of the vertices in their initial positions (or the neutral position) rela-

45

tive to the camera j. Tijijij ][ 3,2,1, ϕϕϕ represents the coordinate vector of vertex i

at its neutral position relative to the camera j.

(g) ijj CM )( ζrr



, Cr

, and the camera j.

The unknown parameter set consists of

[ ]TQQQQQQQQ ttttttttttttttttT 24,23,22,21,14,13,12,11,24,123,122,121,114,113,112,111,1 Lr

=

and TKccC ],,[ 1 K

r= .

The same trick to rearrange Eqs. (3.2) and (3.4) can be applied to Eq. (3.6), which can be

rewritten as

[ ] 2

10

2*

minarg

minarg

bqWAAA

bqAq

Kq

q

rrL

rrr

r

r

−=

−= (3.7)

where:

(a) ][ 2,01,012,011,00 QQ AAAAA L=

+++

+++

=

000)()()(

000)()()(

3,3,0)(

,2,2,0)(

,1,1,0)(

,

3,13,0)(1,2,12,0

)(1,1,11,0

)(1,

1,0

njntnjnjn

tnjnjn

tnj

njt

jnjt

jnjt

j

j

mwmwmw

mwmwmw

Aϕϕϕ

ϕϕϕ

LLL (Q ≥ j ≥ 1)

+++

+++=

)()()(000

)()()(000

3,3,0)(

,2,2,0)(

,1,1,0)(

,

3,13,0)(1,2,12,0

)(1,1,11,0

)(1,

2,0

njntnjnjn

tnjnjn

tnj

njt

jnjt

jnjt

j

j

mwmwmw

mwmwmwA

ϕϕϕ

ϕϕϕLLL (Q ≥ j ≥ 1)

46

(b) ][ 2,1,12,11, QiQiiii AAAAA L= (K ≥ i ≥ 1)

=

000

000

3,)(

,2,)(

,1,)(

,

13,)(1,12,

)(1,11,

)(1,

1,

nitnjni

tnjni

tnj

it

jit

jit

j

ji

mwmwmw

mwmwmw

A LLL (Q ≥ j ≥ 1)

=

3,)(

,2,)(

,1,)(

,

13,)(1,12,

)(1,11,

)(1,

2,

000

000

nitnjni

tnjni

tnj

it

jit

jit

j

ji

mwmwmw

mwmwmwA LLL (Q ≥ j ≥ 1)

(c)

=

)(,

)(,1

)(,

)(,1

)(1,

)(1,1

)(1,

)(1,1

0000

0000

tnQ

tn

tnQ

tn

tQ

t

tQ

t

wwww

wwww

W

L

L

LLLLL

L

L

(d) TtnQ

tnQ

tnQ

tnQ

tn

tn

tn

tn

tQ

tQ

tQ

tQ

tttt ywxwywxwywxwywxwb ][ )(,

)(,

)(,

)(,

)(,1

)(,1

)(,1

)(,1

)(1,

)(1,

)(1,

)(1,

)(1,1

)(1,1

)(1,1

)(1,1 LLL

r=

(e) [ ]TTK

TK

TT qqqqq 110 += rrL

rrr

[ ]TQQQQQQ ttttttttttttq 23,22,23,23,23,23,23,122,121,113,112,111,10 Lr =

[ ]TQiQiQiQiQiQiiiiiiii tctctctctctctctctctctctcq 23,22,23,23,23,23,23,122,121,113,112,111,1 Lr = (K ≥ i ≥ 1)

[ ]TQQK ttttq 24,14,24,114,11 Lr =+

47

Use a least square estimator to solve Eq. (3.7) and get qr . It is then easy to get

[ ]TQQQQQQQQ ttttttttttttttttT 24,23,22,21,14,13,12,11,24,123,122,121,114,113,112,111,1 Lr

= and

],,[ 1 KccC Kr

= from qr .

Again, the forms of Eq. (3.6) and Eq. (3.7) are similar to those of Eq. (3.2) and Eq. (3.3).

Therefore, the programming written for the 2D MU-based tracking algorithm can also be

modified for the 3D MU-based tracking algorithm using multiple cameras without major

changes on the structure of the programming.

3.8 3D MU-BSV-based Facial Motion Tracking

The tracking algorithms presented in Sections 3.2, 3.6, and 3.7 require an accurate face

model (or the facial shape at its neutral state). Here, a new algorithm with looser con-

straints is presented. It assumes any face model ςr can be obtained by

∑=

+=E

eeeh

1γζς rrr (3.8)

where

(a) Tnnn ][ 321131211 ϕϕϕϕϕϕζ L

r= represents the concatenation of the coordi-

nates of the vertices in their initial positions (or the neutral position) relative to the

camera. Tiii ][ 321 ϕϕϕ represents the coordinate vector of vertex i at its neutral

position. In contrast to the previous two 3D MU-based facial motion tracking al-

gorithms, ζr

in Eq. (3.8) is guessed by the tracking algorithm by warping a ge-

neric face model.

(b) The warped generic face model may not well suit the face of the subject. How-

ever, its shape can be adjusted using a set of basic shape variances (BSVs) of the

face by ∑ =

E

e eeh1

γr . Eee 1}{ =γr is the set of BSVs, which can be learned from real fa-

cial 3D shape data, for example, by applying PCA to a set of 3D neutral face

shapes (or face without deformations).

48

(c) The parameter set Eeeh 1}{ = is unknown and will be adjusted during tracking. Let

[ ]Tneneneeeee rrrrrr 3,2,1.13,12,11, Lr =γ .

The new algorithm will use MUs and the BSVs. Therefore, it is called the 3D MU-BSV-

based facial motion tracking algorithm.

Eq. (3.4) can be modified to use the basic face shapes as below.

∑∑ ∑∑

∑ ∑∑

∑

∑∑

∑∑

∑∑

∑ ∑

=

= ==

= ==

=

==

==

==

= =

−

+++

+++=

−

+

++

++

++

=

−++Ψ=

nit

i

ti

s

E

eiseesiss

K

pispps

s

E

eiseesiss

K

pispps

ti

CT

nit

i

ti

E

eieei

K

pipp

E

eieei

K

pipp

E

eieei

K

pipp

ti

CT

ni

tii

E

eee

ti

CT

t

yx

trhttmct

trhttmctw

yx

tt

rhmc

rhmc

rhmc

tttttt

w

vhCMwCT

,1

2

)(

)(

24

3

1 1,22

0,2

14

3

1 1,11

0,1

)(

,

,1

2

)(

)(

24

14

13,3

03,

12,2

02,

11,1

01,

232221

131211)(

,

2

,1

)(

1

)(

,

)(**

)(

)(minarg

minarg

)(minarg),(

ϕ

ϕ

ϕ

ϕ

ϕ

γζ

rr

rr

rr

rrrrrr

(3.9)

The unknown parameter set consists of TKccC ],,[ 1 K

r= , T

EhhH ][ 1 Lr

= and

[ ]TttttttttT 2423222114131211=r

.

Eq. (3.9) can be rewritten as

[ ] 2

12112110201

2*

minarg

minarg

bqWBBAAAAAA

bqAq

EKKq

q

rrLL

rrr

r

r

−=

−= (3.10)

where:

49

(a)

+++

+++

=

000)()()(

000)()()(

33,0)(

22,0)(

11,0)(

313,0)(

1212,0)(

1111,0)(

1

01

nnt

nnnt

nnnt

n

nt

nt

nt

mwmwmw

mwmwmw

Aϕϕϕ

ϕϕϕ

LLL

(b)

+++

+++=

)()()(000

)()()(000

33,0)(

22,0)(

11,0)(

313,0)(

1212,0)(

1111,0)(

1

02

nnt

nnnt

nnnt

n

nt

nt

nt

mwmwmw

mwmwmwA

ϕϕϕ

ϕϕϕLLL

(c)

=

000

000

3,)(

2,)(

1,)(

13,)(

112,)(

111,)(

1

1

nit

nnit

nnit

n

it

it

it

i

mwmwmw

mwmwmw

A LLL (K ≥ i ≥ 1)

(d)

=

3,)(

2,)(

1,)(

13,)(

112,)(

111,)(

1

2

000

000

nit

nnit

nnit

n

it

it

it

i

mwmwmw

mwmwmwA LLL (K ≥ i ≥ 1)

(e)

=

3,)(

12,)(

11,)(

1

3,)(

12,)(

11,)(

1

13,)(

112,)(

111,)(

1

13,)(

112,)(

111,)(

1

000000

000000

nit

nit

nit

nit

nit

nit

it

it

it

it

it

it

i

rwrwrwrwrwrw

rwrwrwrwrwrw

B LLLLLL (E ≥ i ≥ 1)

(f)

=

)(

)(

)(1

)(1

00

00

tn

tn

t

t

ww

ww

W LL

50

(g) Ttn

tn

tn

tn


)(1

)(1

)(1 L

r=

(h) [ ]TTEK

TEK

TK

TK

TT qqqqqqq 1110 ++++= rrL

rrL

rrr

[ ]Tttttttq 2322211312110 =r

[ ] )1(,232221131211 ≥≥= iKtctctctctctcq Tiiiiiii

r

[ ] )1(,232221131211 ≥≥=+ iEththththththq TiiiiiiiK

r

[ ]TK ttq 24141 =+r

Using a least square estimator to solve Eq. (3.9) and get qr . It is then easy to get

[ ]2423222114131211 ttttttttT =r

, TEhhH ][ 1 L

r= , and ],,[ 1 KccC K

r= from qr . Given a

calibrated camera, the pose information about the face can be calculated from Tr

.

The 3D MU-BSV-based facial motion tracking algorithm can be easily generalized to use

multiple cameras by using the same method described in Section 3.7.

51

CHAPTER 4

4 MU-BASED REAL-TIME SPEECH-DRIVEN

FACE ANIMATION

The facial motion analysis and synthesis techniques described in previous chapters pave

the way to achieve MU-based real-time speech-driven face animation. The MU-based

facial tracking algorithm is used to analyze the facial motions of a speaking subject. The

analysis results and the synchronized soundtrack can be collected to train audio-to-visual

mappings. In this chapter, two audio-to-visual mappings are presented and evaluated.

One of them is a local linear mapping. The other is a nonlinear mapping using neural

networks. Both methods consider certain length of speech context and have constant

short time delay.

4.1 Linear Audio-to-Visual Mapping

Linear mapping [87] assumes information in a channel can be calculated from that in an-

other channel by an affine transformation. In the case of audio-to-visual mapping, it can

be written down as

eaTv anvavnrrrrr +−=− )( µµ (4.1)

where nvr is the visual feature vector at time n, nar is the audio feature vector at time n,

vµr and aµr are mean vectors of the visual features and the audio features respectively,

vaT is the affine transformation, and er is the error term which represents the part of

)( ana µrr − that is not correlated with )( vnv µrr − . The transformation can be estimated as

1]))([(]))([( −−−−−= Tanan

Tanvnva aaEavET µµµµ rrrrrrrr (4.2)

The result given by Eq. (4.2) is a minimum variance unbiased estimator of )( vnv µrr −

[89].

52

4.2 Local Linear Audio-to-Visual Mapping

Eq. (4.1) does not consider the contextual information of audio. However, the mouth

coarticulation depends on the context of audio. The length of the context depends on the

content of the audio and the subject who speaks it. It is difficult to decide a particular

value for the length of the context. A practical way to take contextual information into

account is to replace nar with

=

+

−

β

α

n

n

n

n

a

a

a

a

rM

rM

r

r ' (4.3)

where nar is the audio feature vector at time n, α−nar , …, and 1−nar represent the audio his-

tory from time n-α to n-1, and 1+nar , …, and β+nar represent the audio history from time

n+1 to n+β. α and β are the parameters that can be adjusted.

Linear estimation is very computationally efficient. It is ideal for a system with limited

computational resources. However, audio-to-visual mapping is nonlinear in nature. The

performance of the global linear mapping defined by Eq. (4.1) is very limited. There is

one way to improve it. We approximate the true audio-to-visual mapping by a set of lin-

ear mappings, i.e., a set of local linear mappings. Each linear mapping is defined for a

particular case of audio context.

As illustrated in Figure 4.1, the audio-visual training data is divided into 44 subsets ac-

cording to the audio feature of each sample. The audio features of each subset are mod-

eled by a Gaussian model. To divide the audio-visual training set, each audio-visual data

pair is classified into one of the 44 training subsets whose Gaussian model gives the

highest score for the audio component of the audio-visual data pair. Then, a linear audio-

to-visual mapping is calculated for each training subset using Eq. (4.1) and Eq. (4.3). The

reason that we choose 44 is based on a practical issue: our iFACE system uses a symbol

set that consists of 44 phonemes.

Figure 4.1 Local linear audio-to-visual mapping.

Since the variance of the data in each group is smaller than the whole training data set,

the complexity of the audio-to-visual mapping problem is dramatically reduced. Given a

new audio feature, we classify it into one of the classes using the trained Gaussian models

and select the corresponding linear mapping to estimate the visual feature vector.

4.3 Nonlinear Audio-to-Visual Mapping Using ANN

The mapping from audio features to visual features is by nature nonlinear. To achieve

better estimation results, nonlinear mapping should be used when enough computational

en audio features and the visual fea-

Audio feature space Visual feature space 1vaT

ivaT

jvaT

kvaT

mvaT

44vaT

resources are available. The nonlinear relation betwe

53

tures is complicated, and there is no existing analytic expression for the relation. Multi-

layer perceptrons, as a universal nonlinear function approximator, are used to learn the

nonlinear audio-to-visual mapping. In contrast to the approach in [50], the training data is

divided into 44 subsets in the way described in Section 4.2. For each training subset, a

three-layer perceptron is trained using one of the subsets.

The structure of the MLP is shown in Figure 4.2. The input of MLP is the audio feature

vector taken at α+β+1 consecutive time frames (α backward, current, and β forward time

windows). The output of an MLP is a visual feature vector. The estimation procedure is

similar to that of the local linear mapping except that a MLP is selected.

54

Figure 4.2 MLP for nonlinear audio-to-visual mapping.

4.4 Experimental Results

4.4.1 Collect training and testing data

A set of raw video data is collected by recording the front view of a speaking subject. A

set of markers that is the same as that shown in Figure 2.1 (see page 18) is put on the face

of the subject. Therefore, the ground truth can be extracted. The video is captured using a

Panasonic AG-7450 portable video cassette recorder. One hundred sentences are selected

from the text corpus of the DARPA TIMIT speech database. Both the audio and video are

digitized at 30 fps using Final Cut Pro software for Macintosh from Apple Computer In-

corporated. Overall, there are 19433 audio-visual training data samples. Eighty percent of

the data is used for training. The rest is used for testing.

For each speech segment, twelve LPC coefficients are calculated as the audio features.

MUs are used to explain facial deformations. Correspondingly, MUPs are used as the

visual features. The MU-based facial motion tracking algorithm, which is described in

Chapter 4, is used to analyze facial motions.1 This data set is used for training both the

local linear audio-to-visual mapping and the nonlinear audio-to-visual mapping using

1 Although the face is marked, we cannot simply use template matching technique or KLT trackers to track the mark-ers. The reason has been shown by the experimental results in Section 3.4. MU-based facial tracking algorithm is still required

][ 121 kkn ccccv −= Lr

Hidden Layer

[ ]Tn

Tn

Tn

Tn aaaa βα +−= r

Lr

Lrr '

Output layer

Input layer

55

ANN. The first five MUs are used.2 The normalized mean square error of the recon-

structed the data using the first five MUs is 0.0200.3

4.4.2 Implementation

A method using triangular average window is used to smooth the jerky mapping results

of both the local linear audio-to-visual mapping and the nonlinear audio-to-visual map-

ping using ANN. The implementation of MLP is provided by the Neural Network Tool-

box of MATLAB 5.0 of the MathWorks Incorporated. In the experiments, the maximum

number of the hidden units used in those MLPs is only 25. Therefore, both training and

estimation have very low computational complexity. The training of each MLP stops if

either the maximum number of iterations (150) or a preset mean square error threshold

(0.005) is met.

The author also tried to use one MLP to handle all the training data. However, the train-

ing process took too long. The training process ran for 3 weeks on an SGI machine with

12 processors and 2 GB memory. The quality of the intermediate results is far from ac-

ceptable.

4.4.3 Evaluation

We reconstruct the displacement of the mesh vertices using MUs and the estimated

MUPs. The evaluations are performed on the ground truth of the displacements and the

reconstructed displacements. Two evaluation parameters, Pearson product-moment corre-

lation coefficient and the normalized mean square error, are calculated.

• Pearson product-moment correlation coefficient

The Pearson product-moment correlation coefficient between the ground truth and the

estimated data is calculated by

2 More MUs can be used to achieve better results. However, more MUs cause higher computational complexity, espe-cially for nonlinear mapping using neural networks.

3 The displacement of each vertex is scaled to [-1.0, 1.0] by dividing it by the maximum displacement of the vertex.

56

])))([((])))([((

])))([((

2'

2'

11

2'

1

Tnn

Tnn

Tnn

ddEtrddEtr

ddEtrRµµµµ

µµrrrrrrrr

rrrr

−−−−

−−= (4.4)

where ndr

is the ground truth, )(1 ndErr =µ , '

ndr

is the estimation result, and )( '2 ndE

rr =µ .

The Pearson product-moment correlation measures how good the global match between

the shapes of two signal sequences is. The larger the Pearson correlation coefficient, the

better the estimated signal sequence matches with the original signal sequence.

Table 4.1 shows the performances of the global linear mapping, that local linear mapping,

and the local nonlinear mapping. As shown, the local nonlinear mapping works best and

the global linear mapping work worst.

Table 4.1 Real-time speech driven evaluation I.

Training Data Testing Data

Global linear mapping 0.5171 0.4834

Local linear mapping 0.7022 0.6904

Nonlinear mapping using ANN 0.9140 0.8902

However, the training data is not exactly the ground truth. Instead, only the information

contributed by the selected MUs is used. The Pearson coefficients are recalculated by re-

placing ndr

in Eq. (4.4) with the information preserved by the selected MUs. More ex-

actly, ndr

is replaced by 04

1)( mmmd ii

Tin

rrrr+∑ =

, which is called the biased ground truth.

The results are shown in Table 4.2. As is shown, the local linear audio-to-visual mapping

performs better than the global linear audio-to-visual mapping. The nonlinear mapping

using artificial neural networks works best.

57

Table 4.2 Real-time speech driven evaluation II.





• Normalized MSE

The displacement vector ndr

is normalized in the following way. The displacement of

each vertex is scaled to [-1.0, 1.0] by dividing it by the maximum displacement of the

vertex. The MSE of every audio-visual mapping with respect to the ground truth is shown

in Table 4.3.

Table 4.3 Real-time speech driven evaluation III.





The normalized MSE of each audio-to-visual mapping method with respect to the biased

ground truth is also calculated and shown in Table 4.4. The evaluation results based on

MSE index also show that the local linear audio-to-visual mapping performs better than

the global linear audio-to-visual mapping. The nonlinear audio-to-visual mapping using

neural networks works best.

58

Table 4.4 Real-time speech driven evaluation IV.





4.4.4 A speech-driven face animation example

Section 4.4.3 illustrates the experimental results in a macro way. In this section, to give

the readers a straightforward visual perception of the estimation results, a typical example

is randomly selected and shown in detail.

The text of the selected audio track is “Don’t ask me to carry an oil rag like that.” The

global linear mapping, the local linear mapping and the nonlinear mapping using neural

networks are used to estimate the visual feature sequence for the audio track. The results

are shown in Figures 4.3, 4.4, and 4.5 respectively. In those figures, the values of the es-

timated MUPs are shown as trajectories versus time. Four trajectories are shown in each

figure. They correspond to the coefficient trajectories of the MUs 1mr , 2mr , 3mr , and 4mr .

The horizontal axes of the figures represent time. The vertical axes of the figures repre-

sent the magnitude of the MUPs. The solid red lines represent the goal trajectories. The

dashed blue lines represent the estimation results.

The trajectories estimated by the local linear mapping are closer to the goal trajectories

than those estimated by the global linear mapping. The trajectories estimated by the

nonlinear mapping using neural networks are the closest to the goal trajectories.

59

Figure 4.3 The estimation results of the global linear mapping.

c1

c2

c3

c4

60

Figure 4.4 The estimation results of the local linear mapping.

c1

c2

c3

c4

61

Figure 4.5 The estimation results of the nonlinear mapping using neural networks.

c1

c2

c3

c4

62

CHAPTER 5

5 THE IFACE SYSTEM

This chapter describes the iFACE system [32]. The system provides functionalities for

face modeling and face animation and it provides a research platform for the integrated

framework (Figure 1.1 page 16). The system is also being used by other researchers to

carry out research on human perception on synthetic talking faces. Based on the iFACE

system, we won the fourth place in the V. Dale Cozad Business Plan Competition 2000.

5.1 Introduction

The iFACE system takes the CyberwareTM scanner data of a subject’s head as input and

allows the user to interactively fit a generic face model to the CyberwareTM scanner data.

The iFACE system uses the key frame technique for text driven face animation and off-

line speech-driven face animation. The real-time speech driven function of the iFACE

system is based on the techniques described in Chapters 2 and 4.

5.2 Generic Face Model

The generic face model (Figure 5.1) used in the iFACE system was originally bought

from Viewpoint Corporation1 and was modified lately by adding a tongue model and a

teeth model [90]. The head model consists of nearly all the head components such as

face, eyes, teeth, ears, and so on. It consists of 2240 vertices and 2946 triangles. The sur-

faces of the components are approximated by triangular meshes. The tongue component

is modeled by a Non-Uniform Rational B-Splines (NURBS) model which has 63 control

points.

1 Source: http://www.viewpoint.com.

63

One advantage of the polygon topology model is that the calculations of the surface de-

formation can be carried out much faster than those of physics-based models.

Figure 5.1 The generic geometry face model.

5.3 Customize the Face Model

The iFACE system enables the user to customize the generic face model for an individ-

ual. The iFACE system adopts an approach similar to that of [41]. Both methods take the

CyberwareTM cyberscanner data of a subject as the input, ask the user to manually select

some feature points, and warp the generic model to fit the CyberwareTM cyberscanner

data. The differences between them include the definitions of the feature point set and the

ways to warp the generic model.

The laser head and the laser sensor of the CyberwareTM cyberscanner rotate 360 degrees

around the subject who should keep still for a few seconds. The sensor captures the laser

reflected from the surface of the head and measures both the 3D range information and

the texture information of the surface. The range data is a map that records the distance

from the laser sensor to points on the head surface. The texture data is a reflectance im-

age of the laser beam from the head surface. Both the range map and the texture data are

represented in cylindrical coordinates with the longitude of 512 (representing 0-360 de-

(a) Shown as wire-frame (b) Shown as shaded

64

grees) and the latitude of 512. Figure 5.2 shows a pair of range data and texture data,

which are unfolded as 2D images.2

Figure 5.2 An example of the CyberwareTM cyberscanner data.

A coarse to fine approach is used to fit the generic face model to the range data. A coarse

model (Figure 5.3) is built by manually selecting 101 vertices from the generic face

model. The coarse model is a triangular mesh and consists of 164 triangles. The fitting

procedure asks the user to manually select 31 landmarks on the texture map of the Cy-

berwareTM cyberscanner data. The coarse model is first warped to fit the range data. The

whole generic face model is then warped to fit the range data.

Figure 5.3 The coarse model in 2D cylindrical coordinate space.

2 This cyberscanner data is the head of Dr. Russell L. Storms from the Federal Army Research Laboratory.

(a) Texture data (b) Range data

Thirty-one vertices are defined among those 101 vertices of the coarse model. Those ver-

tices correspond to the facial landmarks, such as nose tip, eye corners, mouth corners,

chin, upper line of the neck, bottom line of the neck, and so on. These landmarks imply

the structure information of the head, such as the height and width of the head, the posi-

tions of the ears and eyebrows, the position of the neck, and so on. In the cylindrical co-

ordinate space, those feature points divide the facial surface into many local rectangular

regions (Figure 5.4).

Figure 5.4 The landmarks divide the head surface into many local rectangular re-gions in the cylindrical coordinate space. (a) The boundary of the outer rectangle represents the boundary of the range map. The range map is divided into rectangu-

(a) (b)

65

lar regions whose corners are those landmarks and some points on the boundary. (b) The coarse model is drawn to overlap with the regions in the cylindrical coordi-nate space.

The user manually selects those thirty-one feature points on the texture data. An example

is shown in Figure 5.5. The selected feature points have their correspondences in the

coarse model and provide the coordinates for the landmarks. Once the feature points are

selected, 2D local scaling in both vertical and horizontal directions within each rectangu-

lar region is used to deform the coarse model in the cylindrical coordinates space.

66

Figure 5.5 Select feature points on the texture map.

In the cylindrical coordinate space, the coarse model triangulates the facial surface into

many local triangle patches. Each local triangle patch defines a local affine system. After

the coarse model is fitted, 2D local affine transformations are applied to warp the generic

model in the cylindrical coordinate space. The range values of the vertices are picked up

from the range map. The Cartesian coordinates of each vertex are then calculated from its

longitude, latitude, and range value.

The remaining head components (e.g., eyes, hair, ears, tongue, and teeth) are automati-

cally adjusted by shifting, rotating, and scaling. For example, the teeth model is shifted

according to the position of the feature point that represents the middle point of the lower

contour of the upper lip, and is scaled according to the width of the mouth decided by the

distance between two mouth corners. Manual adjustments on the fitted model are re-

quired where the range data are missed, which lead to the miscalculation of the sizes and

positions of the head components. Figure 5.6 shows an example of a semi-finished face

model after automatic calculations. An interactive interface is developed to adjust the

model. Usually, it will take about a few hours to adjust the model. Figure 5.7 shows some

examples of the face models after manual adjustment.

67

Figure 5.6 A semi-finished face model and the model editor.

Figure 5.7 Examples of the customized face model.

5.4 Face Deformation Control Model

A triangular control model (Figure 5.8) is defined by selecting a subset of vertices from

the generic face model. The control model consists of 115 vertices and 180 triangles. The

factors that govern the selection of control vertices inlcude physiology (the distribution of

facial muscles) as well as various practical considerations relating to the topology of the

68

generic face model. Using the same interface (Figure 5.6) for adjusting the semi-finished

model, the user can move the control points.

Figure 5.8 The control model.

The face model is deformed in an affine transformation like way. In the cylindrical coor-

dinate space, the control model triangulates the facial surface into many local triangle

patches, called control triangles. The vertices of the face model are distributed into those

triangle patches. When the shapes of the control triangles are changed, the coordinates of

other vertices are changed as shown in Figure 5.9.

Assuming a control triangle <P1, P2, P3> are deformed to < P´1, P´2, P´3> in the cylindri-

cal coordinate space (see Figure 5.9) and P´1, P´2, P´3 are the correspondences of P1, P2,

P3 respectively. We calculate the cylindrical coordinates of a point P´, which is the corre-

spondence of a point P inside triangle <P1, P2, P3>, by

+−′+−′+−′=′++=′

++=′

rrrrrrrrgggg

tttt

)()()( 333222111

332211

332211

λλλλλλ

λλλ (5.1)

where 111

321

321

gggttt

W = , 111

132

32

1 gggttt

W=λ ,

111

131

31

1 gggttt

W=λ ,

111

121

21

3 gggttt

W=λ .

69

Figure 5.9 Local affine transformation for facial surface deformation.

The model editor, shown in Figure 5.6, can be used to manually adjust the coordinates of

the control points. Figure 5.10 shows an example that uses the model editor to create a

facial expression.

Figure 5.10 Create facial shape using the model editor.

P3 (t3, g3, r3)

P1 (t1, g1, r1)

P2 (t2, g2, r2)

P (t, g, r)

P´1 (t´1, g´1, r´1)

P´3 (t´3, g´3, r´3) P´2 (t´2, g´2, r´2)

P´ (t´, g´, r´)

70

A library of facial shapes, which consists of expressions and visemes3 (Figure 5.11), is

created manually by an artist.

Figure 5.11 Examples of facial expressions and visemes. (a) smile, (b) disgust, (c) surprise, (d) laugh, (e) viseme ‘f’, (f) viseme ‘i’, (g) viseme ‘o’, and (h) viseme ‘e’.

5.5 Text Driven Face Animation

When text is used in communication, e.g., in the context of text-based electronic chatting

over the Internet or visual email, visual speech synthesized from text will greatly help

deliver information. Recent work on text driven face animation includes the work of

Cohen and Massaro [15], Ezzat and Poggio [26], and Waters and Levergood [83].

These works differ from each other in their face models and interpolation functions as

long as only face modeling and animation is regarded. Cohen and Massaro [15] use a pa-

3 A viseme is a generic facial shape that serves to describe a particular sound. A viseme is the visual equivalent of a phoneme.

(a) (b) (c) (d)

(e) (f) (g) (h)

71

rametric geometric face model and Löfqvist a facial articulatory gesture model for calcu-

lating the parameters of the face model.

Ezzat and Poggio [26] use facial images directly. They collect a set of facial images that

correspond to visemes. Those images are used as key frames. The pixel correspondences

between two key frames are calculated using the optical flow technique developed by

Bergen and Hingorani [5]. The face animation is achieved by morphing between key

frames based on the correspondences calculated. They adopted the morphing technique

proposed by Beier and Neely [4].

Waters and Levergood [83] use a geometric face model. A set of facial shapes is manu-

ally edited. Those facial shapes correspond to visemes and are used as key frames during

the animation procedure. The facial shapes between two key frames are calculated by

morphing between two key frames. The morphing parameters, or the weights of the key

frames, are calculated by a linear or nonlinear transformation of time t. A physics-based

technique for calculating vertex displacements is also described.

Similar to the work of Ezzat and Poggio [26] and that of Waters and Levergood [83], the

iFACE system adopts the key frame based face animation technique for text driven face

animation. The procedure of the text driven face animation is illustrated in Figure 5.12.

The iFACE system uses Microsoft Text-to-Speech (TTS) engine4 for text analysis and

speech synthesis. First, the text stream is fed into the TTS engine. TTS parses the text and

generates the corresponding phoneme sequence, the timing information of phonemes, and

the synthesized speech stream. Each phoneme is mapped to a viseme based on a lookup

table. Each viseme is a key frame. Therefore, the text is translated in to a key frame se-

quence. Face animation is done by the morphing technique described in [83].

4 Microsoft TTS is publicly available at the download page of Microsoft cooperation. The URL of the download page is being changed from time to time. Therefore, the URL is not provided here. The user can search for it at http://www.microsoft.com.

72

The key frames are located on the positions of one third of phoneme durations. The facial

deformations between two consecutive key frames are decided by interpolation. The

weights of the key frames are calculated by

10 )1( ttt kkfrrr

αα −+= (t0 < t < t1) (5.2)

where tfr

is the facial deformation at time t, 0tkr

and 1tkr

are the two key frame represented

as facial deformations, and α is calculated by

2

cos101

0

01

0

−−−

−−

=tttt

tttt π

α (5.3)

Figure 5.12 The architecture of text driven face animation.

The facial deformations are added to the neutral facial shape to obtain the final facial

shapes. Combining with a script of expression sequence, we can use a synthesize expres-

sive talking head, such a head that talks while nodding, blinking eyes, raising eyebrows,

Speech stream

Viseme Sequence

Text stream

Text to Speech Engine

Map phoneme to viseme

Animate the face model

Play speech stream

Analyze text

Phoneme sequence & Timing information

Synthesize speech

Generate key frame sequence

Key frame sequence

Synchronize

Phoneme sequence

Timing inform

ation

73

and so on. The facial deformation of each facial expression is treated as an additive key

frame. This method is reasonable when the expression only involves upper face deforma-

tion. When the expression involves lower face deformation, such as smiling, disgust, and

so on, this method might cause artifacts.

The iFACE system uses a label system that has 44 phonemes and 17 visemes in the

iFACE system. The phonemes and their viseme groups are shown in Table 5.1.

Table 5.1 Phoneme and viseme used in the iFACE system.

Phoneme Word Viseme group Phoneme Word Viseme group AA cot 14 IX kisses 5 AE bat 13 JH judge 4 AY buy 8 K kit 5 AW down 8 LL led 11 AO bought 15 EL bottle 11 OY boy 5 M mom 1 EH bet 8 N nun 5 EY bait 8 NX sing 8 AX bird 7 EN button 5 IH bit 8 P pop 1 IY beat 10 R red 9

OW boat 16 YU cute 12 UH book 7 S sister 5 AH but 8 SH shoe 4 UW lute 6 T butter 5

B bob 1 TH thief 3 CH church 4 V verve 2 D dad 5 W wet 6

DH they 3 Y yet 5 F fin 2 Z zoo 5 G gag 5 ZH measure 4

HX hay 8

SIL SILENCE 17

5.6 Off-line Speech-driven face animation

When human speech is used in one-way communication (e.g., news broadcasting over

networks), off-line speech driven talking face is required. The process of off-line speech-

driven face animation of the iFACE system is illustrated in Figure 5.13. A speech stream

74

is first recognized into a phoneme sequence. The timing information of the phoneme se-

quence is also recorded. Once the phoneme sequence and the timing information are

given, the iFACE system animates the face model using the key frame technique, which

is the same as text driven face animation.

Figure 5.13 The architecture of off-line speech-driven face animation.

Recognizing phoneme using speech signals alone requires a complicated continuous

speech recognizer. Moreover, the phoneme recognition rate and the timing information of

the phoneme may not be accurate enough. The text script associated with speech provides

the accurate word-level transcription, which can be used to reduce the complexity of the

phoneme recognition problem and improve the recognition rate. The iFACE system uses

a phoneme recognition and alignment tool that comes with HTK 2.0 for UNIX.

An example of off-line speech-driven face animation sequence is shown in Figure 5.14.

The image sequence corresponds to the word “animation.” There are 70 frames in total.

Only the mouth region of every other frame is shown in Figure 5.14.

Speech stream

Map phoneme sequence to viseme se-

quence

Animate the face model

Play speech stream

Offline phoneme recognition and

alignment

Phoneme sequence & Timing information

Generate key frame sequence

Key frame sequence & timing information

Synchronize

Viseme sequence & Timing information

Speech stream & Text transcript

75

Figure 5.14 An example of off-line speech-driven face animation. The images are shown in order according to time. The time increases from left to right and from top to bottom.

5.7 Real-Time Speech-driven face animation

So far, the iFACE system uses the key frame technique to animate the face model. Using

the method described in Section 2.3.1, a slight modification is required to enable the

iFACE system to adopt MU for face animation. In Chapter 4, two new real-time audio-to-

MUP mappings are presented. Combining the MU-based face animation and real-time

audio-to-MUP mapping, we can add real-time speech-driven face animation functionality

to the iFACE system. Figure 5.15 shows the synthesized image sequence of the word

76

“animation” using nonlinear real-time audio-to-MUP mapping. The speech segment of

the image sequence in Figure 5.14 is used.

Figure 5.15 An example of nonlinear real-time speech-driven face animation. The images are shown in order according to time. The time increases from left to right and from top to bottom. Only the mouth region of every other frame is shown.

Comparing Figures 5.14 and 5.15, it can be seen that the animation results of off-line

speech-driven face animation are smoother than those of real-time speech-driven face

animation with constant short time delay. This can be quickly noticed by just looking at

77

the last row images in those two figures. This result is expected because off-line speech-

driven face animation has whole speech contextual information while real-time speech-

driven face animation only uses fixed-length contextual speech information.

5.8 The iFACE System in the Distributed Collaborative Environments

The iFACE system was demonstrated twice on site in the Army Research Lab Sympo-

sium 2000 and the Army Research Lab Symposium 2001. Recently, a shoulder model

was added to the face model (Figure 5.16).

Figure 5.16 A shoulder model is added to the face model.

The iFACE system is used to support collaboration in a distributed environment, where

users are in different types of environments and use heterogeneous hardware platforms.

The collaborators are connected via wireless networks. Remote participants are repre-

sented as avatars in the system. The faces of the avatars are driven by speech.

There are personnel in the central base who are in charge of processing information from

field personnel, reasoning and planning. They use desktop PCs and see through head

mounted displays (Figure 5.17(a)). The field personnel are mobile units. They are respon-

sible for providing the latest field information and executing plans. They use a vehicle

based mobile computing station called MIC3E (Figure 10(b)(c)).5 MIC3E has space for

two persons. It is equipped with two Pentium III 500 MHz PCs with 128 MB memory,

running Windows NT 4.0. MIC3E has three displays that include a Pioneer 50 in main

screen and two 17 in desk screens. The users can switch the materials being displayed to

any of the three screens. Other mobile individuals are equipped with lightweight portable

devices such as laptops.

Figure 5.17 The iFACE system in a distributed collaborative environment. (a) Ava-tar in the head mounted display, (b) avatar in the desk screen of MIC3E, (c) avatar in the main screen of MIC3E.

In this distributed collaborative environment, our avatar system helps the collaboration of

users in heterogeneous conditions by providing an alternative approach for traditional

video based face-to-face interaction. The bandwidth saved by the avatar system is used

for transmitting other data.

(a) (b) (c)

78

5 MIC3E is built by Sytronics Inc.

79

CHAPTER 6

6 CONCLUSIONS AND FUTURE WORK

6.1 Summary

This dissertation describes an integrated framework for face modeling, facial motion

analysis and synthesis. The framework provides a systematic guideline for research on

face modeling and animation. The guideline contains the following steps.

The start point is to select a quantitative visual representation for facial deformations. The

visual representation should provide enough information for deforming face models and

be suitable for explaining real facial deformation. In this thesis, MU is adopted for

modeling facial deformations. MUs are learned from a set of labeled real facial de-

formations. Therefore, the MU is suitable for realistic face animation and encodes the

characteristics of facial deformations. Arbitrary facial deformation can be approximated

by a linear combination of MUs weighed by MUPs. It is shown that the MU-based face

animation technique is compatible with the key frame based animation technique and the

MPEG-4 face animation standards.

Then, the visual representation is used in facial motion analysis. The analysis results can

be used directly for face animation. A real-time robust MU-based facial motion tracking

algorithm is presented. The tracking algorithm integrates low-level information, which is

obtained by optical flow calculation techniques, and high-level knowledge, which is rep-

resented by MUs. The tracking results are represented as an MUP sequence, which can be

immediately used for MU-compatible face animation techniques.

Activities cohere typically beyond modalities. This is typically true for facial deforma-

tions and speech. The audio channel (speech) and the visual channel (facial deformation

sequence) are highly correlated. Given the facial deformation control model and facial

motion analysis tool, it is now possible to explore the quantitative association between

audio-track and facial behavior. A set of video of a speaking subject is collected. The

80

visual part of the video is processed by the MU-based facial motion tracking algorithm.

The results are represented as MUP sequences. The features of the audio tracks are calcu-

lated. Two real-time audio-to-visual mappings with constant short time delay are exam-

ined. One is a local linear mapping. The other is a local nonlinear mapping using MLP.

The framework is used to guide the development of a face modeling and animation sys-

tem, called iFACE [32]. The system provides functionalities for building face model for

any individual, text driven face animation, and offline and real-time online speech-driven

face animation.

6.2 Future Research

Future research should be conducted to improve the framework to develop a highly lip-

readable synthetic talking face for human auditory-visual speech perception studies and

human face-to-face communication in noisy environments.

6.2.1 Explore better visual representation

Continuous endeavor is required to investigate the best visual representation of facial

movements. Currently, PCA is used to learn MUs. PCA is a second-order technique that

assumes the data has a Gaussian distribution. One of its advantages is that it requires only

classical matrix manipulations and thus is computationally and conceptually simple.

However, the second-order information of facial movements is not enough for developing

highly lip-readable synthetic talking head.

One possible improvement is to use independent component analysis (ICA) [33] for MU

learning. ICA is a higher-order technique and assumes non-Gaussianity of the data. ICA

tries to find a representation that minimizes the statistical dependence of the components

of the representation. ICA may better capture the structure of facial motion than PCA.

We shall evaluate the goodness-of-fit results based on both the mean-squared errors be-

tween the approximation and the ground truth and subjective tests of human perceivers.

Currently, MUs only cover the lower part of the face. Future work can extend MUs to

encode the 3D information of the whole face. The 3D facial deformation data for training

81

3D MUs can be synchronously captured by multiple camera systems, for example the Vi-

sion-1 system.1

6.2.2 Improve and evaluate the facial motion tracking algorithm

New MUs will require the update of the MU-based facial motion tracking algorithm. If

3D MUs are finally available, a 3D MU-based facial motion tracking algorithm can be

developed and implemented. Theoretically, a 3D MU-based facial motion tracking algo-

rithm is presented in Section 3.6.

Facial motion tracking using multiple cameras will help to increase the robustness of the

tracking algorithm. Multiple cameras are especially helpful for handling occlusions.

Those cameras should work synchronously. The theory of the 3D MU-based facial mo-

tion tracking algorithm is developed and presented in Section 3.7.

The tracking algorithms presented in Sections 3.2, 3.6, and 3.7 require an accurate face

model (or the facial shape at its neutral state). This makes them difficult to generalize be-

cause it is not easy to obtain an accurate enough model for any individual. A new track-

ing algorithm, called 3D MU-BSV-based facial motion tracking algorithm, is presented

in Section 3.8. In contrast to the previous two 3D MU-based facial motion tracking algo-

rithms, the face model of an individual is first guessed by warping a generic face model

based on some manually selected facial feature points. The warped generic face model

may not well suit the face of the subject. However, its shape can be adjusted using a set

of BSVs of the face, which can be learned from real 3D face shape data by, for example,

PCA. The parameters of those basic facial shapes are unknown and will be adjusted dur-

ing tracking. Therefore, the tracking algorithm eventually estimates both the face model

and the parameters of face and facial motions.

Sections 3.6, 3.7, and 3.8 develop the theory for the above three tracking algorithms. No

experimental results are provided due to the following two practical issues: (1) 3D MUs

are not currently available because the equipment for collecting 3D facial motion training

1 http://www.vision1.com/.

82

data is not available in the lab; and (2) a working multiple camera system is not accessi-

ble. Once those conditions are fulfilled, the evaluations of the tracking algorithm can be

trivialized.

Note, the tracking algorithms developed in this thesis can be used to track not only face

and facial motion but also the motion of other non-rigid and highly articulated object

(e.g., human hands and body).

6.2.3 Refine audio-to-visual mapping

In the future, the personal digital assistant (PDA) will be very popular in wireless com-

munication. PDA will enable human face-to-face communication. However, in many

situations, limited bandwidth permits only audio (but not video) transmission. Using the

audio speech to animate a synthetic talking face provides an effective solution. It is im-

portant to develop a real-time speech driven highly lip-readable synthetic talking head.

Research can be conducted to investigate how to incorporate dynamic Bayesian networks

(DBNs) [91] into the audio-to-visual mapping. There has been increasing interest in ap-

plying DBNs to speech recognition in recent years [73], [91]. DBNs use factored state

representation, which requires exponentially fewer parameters than HMMs. Factored

state representation also enables DBNs to explicitly represent many phenomena that can-

not be directly modeled by HMMs, for example articulator positions, speaker-gender, and

speaking rate. Therefore, DBNs are more interpretable and computationally efficient

than HMMs. DBNs will enable us to train the audio-to-visual mapping in a more pur-

poseful way.

6.2.4 Human perception on synthetic talking face

It is human beings who will finally enjoy face animation. Human perceptual experiments

should be designed to develop and test hypotheses about stimulus characteristics of the

auditory visual speech signal related to enhanced human speech perception. Experimental

results should be fed back to guide the engineering research and improve the model of the

synthetic talking face. Future versions of the integrated framework for face model, facial

motion analysis, and synthesis research should include human in the loop.

83

6.2.5 Improve the tongue models

Besides facial animation, tongue animation contributes to enhanced speech understand-

ing. It is important to incorporate tongue motion in a version of our synthetic talking face

to obtain comparison perceptual data. Similar to the face model and animation research,

the visual representation of the tongue deformation should be first decided and learned

from real data. There is a publicly available X-ray Microbeam Speech Production Data-

base collected by the University of Wisconsin [85]. It consists of simultaneous acoustic

and kinematic recordings for speech collected from more than 50 normal American Eng-

lish speakers. To track kinematic signals, small gold pellets were used as markers and

glued to the following locations: along the midline length of the tongue, incisors, one mo-

lar tooth of the mandible, and in the midline at the vermillion border of each lip.

6.3 Improving the Key Frames of the iFACE System

Currently, the key frames of the iFACE system are created manually. Some regions of the

key frames are not well done, which degrades the animation results. When the experi-

mental conditions allow, those key frames can be refined by using real data. For example,

a subject can be carefully selected. His/her 3D facial deformations can be captured using

multiple camera systems while a set of markers is put on his/her face.

84

7 REFERENCES

[1] K. Aizawa and T. S. Huang, “Model-based image coding,” Proc. IEEE, vol. 83, pp.

259-271, Aug. 1995.

[2] S. Basu and A. Pentland, “A three-dimensional model of human lip motions trained

from video,” in Proc. IEEE Non-Rigid and Articulated Motion Workshop at

CVPR’97, San Juan, June 1997, pp. 46-53.

[3] S. Basu, N. Oliver, and A. Pentland, “3D modeling and tracking of human lip mo-

tions,” in Proc. ICCV’98, Bombay, India, January 1998.

[4] T. Beier and S. Neely, “Feature-based image metamorphosis,” in SIGGRAPH’92,

Chicago, IL, 1992, pp. 35-42.

[5] J. R. Bergen and R. Hingorani, “Hierarchical motion-based frame rate conversion,”

Technical Report, David Sarnoff Research Center, Princeton, New Jersey, April

1990.

[6] A. Blake, R. Cuiwen, and A. Zisserman, “Affine-invariant contour tracking with

automatic control of spatiotemporal scale,” in Proc. ICCV’93, Berlin Germany,

May 1993, pp. 66-75.

[7] A. Blake, M. A. Isard and D. Reynard, “Learning to track the visual motion of con-

tours,” Artificial Intelligence, vol. 78, pp. 101-134, 1995.

[8] M. Brand, “Voice puppetry,” in SIGGRAPH’99, 1999.

[9] C. Bregler and Y. Konig, “Eigenlips for robust speech recognition,” In Proc. Int.

Conference on Acoustic, Speech, Signal Processing, Adelaide, 1994, pp. 669-672.

[10] C. Bregler, M. Covell, and M. Slancy, “Video rewrite: Driving visual speech with

audio,” in SIGGRAPH’ 97, 1997.

85

[11] C. Carlson and O. Hagsand, “DIVE - A platform for multi-user virtual environ-

ments,” Computer and Graphics, vol. 17, no. 6, pp. 663-669, 1993.

[12] M. Chan, “Automatic lip model extraction for constrained contour-based tracking,”

in Proc. Int. Conf. of Image Processing, Kobe, Japan, 1999.

[13] C. S. Choi, Kiyoharu, H. Harashima, and T. Takebe, “Analysis and synthesis of

facial image sequences in model-based image coding,” IEEE Transaction on Cir-

cuits and Systems for Video Technology, vol. 4, pp. 257-275, June 1994.

[14] T. Chen, and R. R. Rao, “Audio-visual integration in multimodal communications,”

Proceedings of the IEEE, vol. 86, no. 5, pp. 837--852, May 1998.

[15] M. M. Cohen and D. W. Massaro, “Modeling coarticulation in synthetic visual

speech,” in Models and Techniques in Computer Animation, N.M. Thalmann and

D. Thalmann, eds. Tokyo: Springer-Verlag, 1993, p. 139-156.

[16] R. A. Cole, D. W. Massaro, J. de Villiers, B. Rundle, K. Shobaki,

J. Wouters, M. M. Cohen, J. E. Beskow, P. Stone, P. Connors,

A. Tarachow, and D. Solcher, “New tools for interactive speech and

language training: Using animated conversational agents in the

classrooms of profoundly deaf children,” in Proceedings of ESCA/SOCRATES

Workshop on Method and Tool Innovations for Speech

Science Education, London, UK, Apr 1999.

[17] T. F. Cootes, C. J. Taylor, et al., “Active shape models – their training and applica-

tion,” Computer Vision and Image Understanding, vol. 61, no. 1, pp. 38-59, Jan.

1995.

[18] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” in H.

Burkhardt and B. Neumann, eds., 5th European Conference on Computer Vision,

vol. 2, 1998, pp. 484-498.

[19] M. Covell and C. Bregler, “Eigen-points”, in Proc. IEEE Int. Conf. on Image Proc-

essing, vol.3, 1996, pp 471-474.

86

[20] S. Curinga, F. Lavagetto, F. Vignoli, “Lip movements synthesis using time-delay

neural networks”, in Proc. EUSIPCO-96, Trieste, 1996.

[21] D. DeCarlo and D. Mataxas, “Optical flow constraints on deformable models with

applications to face tracking”, Int. Journal of Computer Vision, vol. 38, no. 2, pp.

99-127, 2000.

[22] P. Eisert, T. Wiegand, and B. Girod, “Model-aided coding: A new approach to in-

corporate facial animation into motion-compensated video coding,” IEEE Transac-

tions on Circuits and Systems for Video Technology, vol. 10, no. 3, pp. 344-358,

Apr. 2000.

[23] P. Ekman and W. V. Friesen, “Facial action coding system,” Palo Alto, Calif.: Con-

sulting Psychologists Press, Inc., 1978.

[24] P. Ekman, T. S. Huang, T.J. Sejnowski and J.C. Hager, eds., Final report to NSF

of the planning workshop on facial expression understanding, Human Interaction

Laboratory, University of California, San Francisco, March, 1993.

[25] I. A. Essa and A. Pentland, “Coding Analysis, Interpretation, and Recognition of

Facial Expressions,” IEEE Transaction Pattern Analysis and Machine Intelligence,

vol. 10, no. 7, pp. 757 - 763, Jul. 1997.

[26] T. Ezzat and T. Poggio, “Visual speech synthesis by morphing visemes”, Interna-

tional Journal of Computer Vision 38(1), pp. 45-57, 2000.

[27] O. Faugeras, Three-Dimensional Computer Vision: a Geometric Viewpoint, MIT

Press, 1993.

[28] T. Goto, M. Escher, C. Zanardi, N.M. Thalmann "MPEG-4 based animation with

face feature tracking". CAS '99 (Eurographics Workshop on Animation and Simula-

tion), Milano, Italy, September. 7-8 1999.

[29] B. Guenter et al. “Making faces”, in Proc. SIGGRAPH '98, 1998.

87

[30] P. Hong, “Facial expressions analysis and synthesis,” MS thesis, Computer Sci-

ence and Technology, Tsinghua University, July, 1997.

[31] P. Hong, T. Huang, and X. Lin, “Mouth motion learning and generating from ob-

servation,” in IEEE Workshop on Multimedia Signal Processing, Dec. 7-9, 1998.

[32] P. Hong, Z. Wen, and T. S. Huang, “iFACE: a 3D synthetic talking face,” Interna-

tional Journal of Image and Graphics, vol. 1, no. 1, pp. 1-8, 2001.

[33] A. Hyvärinen. “Survey on independent component analysis,” Neural Computing

Surveys, vol. 2, pp. 94-128, 1999.

[34] I. T. Jolliffe, Principal Component Analysis, Springer-Verlag, New York, 1986.

[35] P. Kalra, A. Mangili, N. Magnenat Thalmann, D. Thalmann, “Simulation of facial

muscle actions based on rational free form deformations,” in Proc. Eurographics

'92, pp. 59-69.

[36] M. Kass, A. Witkin and D. Terzopoulos, “Snakes: Active contour models,”

International Journal of Computer Vision, vol. 1, no. 4, pp. 321-331, 1988.

[37] R. Kaucic and A. Blake, “Accurate, real-time, unadorned lip tracking,” in Proc.

ICCV’98, pp. 370-375.

[38] M. Kirby and L. Sirovich, "Application of the Karhunen-Loeve procedure for the

characterization of human faces," IEEE Transaction Pattern Analysis and Machine

Intelligence, vol. 12, pp. 103-108, 1990.

[39] S. Kshirsagar and N. Magnenat-Thalmann, “Lip synchronization using linear pre-

dictive analysis,” in Proceedings of IEEE International Conference on Multimedia

and Expo, New York, August 2000.

[40] F. Lavagetto, “Converting speech into lip movements: A multimedia telephone for

hard of hearing people,” IEEE Transactions on Rehabilitation Engineering, vol. 3,

no. 1, March 1995.

http://www.ifp.uiuc.edu/~hong/Resume/ms_project.htm

88

[41] Y. C. Lee, D. Terzopoulos and K. Waters, “Realistic modeling for facial anima-

tion,” in SIGGRAPH’95, pp. 55-62.

[42] W. H. Leung, K. Goudeaux, S. Panichpapiboon, S. B. Wang and T. Chen, “Net-

worked intelligent collaborative environment (NetICE),” in Proceeding of IEEE

Intl. Conf. on Multimedia and Expo., New York, 2000.

[43] J. P. Lewis, “Automated lip-sync: Background and techniques,” J. Visualization

and Computer Animation, vol. 2, pp. 118-122, 1991.

[44] H. Li, P. Roivainen and R. Forchheimer, “3-D motion estimation in model-based

facial image coding,” IEEE Trans. On Pattern Analysis and Machine Intelligence,

vol. 15, no. 6 pp. 545-555, 1993.

[45] A. Löfqvist, “Speech as audible gestures,” In W. J. Hardcastle and A. Marchal,

eds., Speech Production and Speech Modeling, Dordrecht: Kluwer Academic Pub-

lishers, pp. 289-322.

[46] B. D. Lucas and T. Kanade, “An iterative image registration technique with an ap-

plication to stereo vision,” in Proceedings of International Joint Conference on Ar-

tificial Intelligence, pp. 674-679, 1981.

[47] J. Mandeville, J. Davidson, D. Campbell, et al., “A shared virtual environment for

architectural design review,” CVE'96 Workshop Proceedings, Nottingham, UK,

1996.

[48] D. W. Massaro, Speech Perception by Ear and Eye: A Paradigm for Psychological

Inquiry, Hillsdale, NJ: Lawrence Erlbaum Associates, 1987.

[49] D. W. Massaro, Perceiving Talking Faces, MIT Press, 1998.

[50] D. W. Massaro, J. Beskow, et al. “Picture my voice: audio to visual speech synthe-

sis using artificial neural networks”, in Proc. AVSP'99, Santa Cruz, USA.

89

[51] K. Matsuno and S. Tsuji, "Recognizing human facial expressions in a potential

field," in Proc. ICPR, 1994, pp. 44-49.

[52] I. Matthews, T. Cootes, et al., “Lipreading from shape shading and scale,” in Proc.

Auditory-Visual Speech Processing, Terrigal, Australia, 1998, pp.73-78.

[53] S. Morishima, K. Aizawa and H. Harashima, “An intelligent facial image coding

driven by speech and phoneme,” in Proc. IEEE ICASSP, Glasgow, UK, 1989, pp.

1795.

[54] S. Morishima and H. Harashima, “A media conversion from speech to facial image

for intelligent man-machine interface”, IEEE J. Selected Areas in Communications,

vol. 4, pp. 594-599, 1991.

[55] S. Morishima, “Real-time talking head driven by voice and its application to com-

munication and entertainment,” in Proceedings of the International Conference on

Auditory-Visual Speech Processing, 1998, Terrigal, Australia.

[56] J. L. Mundy and A. Zisserman. Geometric Invariance in Computer Vision. MIT

Press, 1992

[57] K. Nagao and A. Takeuchi, “Speech dialogue with facial displays,” in Proc. 32nd

Annual Meeting of the Asso. for Computational Linguistics, 1994, pp. 102-109.

[58] M. Nahas, H. Huitric, and M. Saintourens, “Animation of a B-spline figure,” The

Visual Computer, vol. 3, pp. 272-276, 1988.

[59] G. M. Nielson, “Scattered Data Modeling,” IEEE Computer Graphics and Applica-

tions, vol. 13, no. 1, pp. 60-70, 1993.

[60] NTT Software Corporation Interspace, 3D virtual environment.

[61] I. Pandzic, J. Ostermann, D. Millen, “User evaluation: Synthetic talking faces for

interactive services,” The Visual Computer, vol. 15, issue 7/8, pp. 330-340, No-

vember 1999.

90

[62] F. I. Parke, “A parametric model of human faces,” Ph.D. thesis, University of Utah,

1974.

[63] F. I. Parke, “A parameterized model for facial animation”, IEEE Computer Graph-

ics and Applications, vol. 2, no. 9, pp. 61-70, 1982.

[64] F. I. Parke and K. Waters. Computer Facial Animation. AKPeters, Wellesley, Mas-

sachusetts, 1996.

[65] A. Pearce, B. Wyvill, G. Wyvill, and D. Hill, “Speech and expression: A computer

solution to face animation,” Graphics Interface 1986.

[66] C. Pelachaud, N. I. Badler, and M. Steedman, “Linguistic issues in facial anima-

tion,” in N. M. Thalmann and D. Thalmann, eds. Computer Animation ’91 Tokyo:

Springer-Verlag.

[67] F. Pighin, et al., “Synthesizing realistic facial expressions from photographs”, in

Proc. SIGGRAPH ’98, 1998.

[68] S. M. Platt and N. I. Badler, “Animating facial expression,” in SIGGRAPH’81, pp.

245-252.

[69] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in

speech recognition,” Proc. of the IEEE, vol. 77, no. 2, pp. 257-286, 1989.

[70] R. Rao and T. Chen, “Exploiting audio-visual correlation in coding of talking head

sequences,” in Picutre Coding Symposium’ 96, Melbourne, Australia, March 1996.

[71] L. Reveret and C. Benoit “A new 3D lip model for analysis and synthesis of lip

motion in speech production,” in Proc. of the Second ESCA Workshop on Audio-

Visual Speech Processing, Terrigal, Australia, Dec. 1998.

[72] J. Shi and C. Tomasi, “Good features to track,” in Proceedings of IEEE Conference

on Computer Vision and Pattern Recognition, 1994, pp. 593-600.

91

[73] T. A. Stephenson, H. Bourlard, S. Bengio and A. C. Morris, “Automatic speech

recognition using dynamic Bayesian Networks with both acoustic and articulatory

variables,” in Proceedings of 6th International Conference on Spoken Language

Processing, 2000.

[74] D. G. Stork and M. E. Hennecke, eds., Speechreading By Humans and Machines,

NATO ASI Series, Springer, 1996.

[75] H. Tao and T. S. Huang, “Explanation-based facial motion tracking using a piece-

wise Bezier volume deformation model,” in Proc. IEEE Computer Vision and Pat-

tern Recognition, 1999.

[76] D. Terzopoulos and K. Waters, “Techniques for realistic facial modeling and ani-

mation,” In M. Magnenat-Thalmann and D. Thalmann, eds., Computer Animation

’91, Tokyo, 1991. Springer-Verlag.

[77] D. Terzopoulos and K. Waters, “Analysis and synthesis of the facial image se-

quences using physical and anatomical models,” IEEE Transaction on Pattern

Analysis and Machine Intelligence, vol. 15, no. 6, pp. 569 - 579, Jun. 1993.

[78] C. Tomasi and T. Kanade, “Detection and tracking of point features,” Carnegie

Mellon University Technical Report CMU-CS-91-132, April 1991.

[79] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of Cognitive Neu-

roscience, pp. 71 - 86, 1991.

[80] M. L. Viaud and H. Yahia, “Facial animation with wrinkles,” in Third Workshop

on Animation, Eurographics ’92, Cambridge, 1992.

[81] F. Vignoli, S. Curinga, and F. Lavagetto, “A neural clustering architecture for esti-

mating visible articulatory trajectories,” in Proc. ICANN96, Bochum, July 1996,

pp. 863-869.

[82] K. Waters, “A muscle model for animating three-dimensional facial expressions,”

Computer Graphics, vol. 21, no. 4, pp. 17-24, July 1987.

92

[83] K. Waters and T. M. Levergood, “DECface, an automatic lip-synchronization algo-

rithm for synthetic faces,” Digital Equipment Corporation, Cambridge Research

Lab, Technical Report CRL 93-4.

[84] K. Waters, J. M. Rehg, M. Loughlin, et al., “Visual sensing of humans for active

public interfaces,” Cambridge Research Lab, Technical Report CRL 96-5.

[85] J. Westbury, E. J. Severson, and M. Hashi, X-ray microbeam speech production

database user’s handbook, Madison, WI. 1994.

[86] L. Williams, “Performance-driven facial animation”, Computer Graphics, no. 24,

vol. 2, pp. 235-242, Aug. 1990.

[87] H. Yehia, P. Rubin, and E.V. Bateson, “Quantitative association of vocal-tract and

facial behavior”, Speech Communication, vol. 26, pp. 23-43, 1998.

[88] A. Yullie, P. Hallinan, and D. Cohen, “Feature extraction from faces using deform-

able templates,” Int. Journal of Computer Vision, vol. 8, no. 2, pp. 99-111, 1992.

[89] S. Zacks, The Theory of Statistical Inference. Wiley, New York. 1971.

[90] Z. Wen, Tongue and teeth modeling for face modeling and animation, Master The-

sis, Computer Science, University of Illinois at Urbana Champaign, 199.

[91] G. Zweig, “Speech recognition with dynamic Bayesian networks,” Ph.D. thesis,

Computer Science, UC Berkeley, 1998.

[92] “Text for CD 14496-2 Video,” ISO/IEC JTC1/SC29/WG11 N1902, Nov. 1997.

93

8 VITA

Pengyu Hong was born on May 16, 1973, in Zhangzhou, P. R. China. He received the

Bachelor of Engineering degree and Master of Engineering degree from Tsinghua Uni-

versity, Beijing, China, in 1995 and 1997 respectively. Both degrees are in Computer

Science.

In August 1997, Mr. Hong joined the Ph.D. program at the Department of Computer Sci-

ence of the University of Illinois at Urbana-Champaign, Urbana, Illinois, US. He works

as a research assistant in the Image Formation and Processing Laboratory at the Beckman

Institute for Advanced Science and Technology.

His research interest covers a broad scope in image and video processing, human com-

puter interaction, computer graphics, computer vision and pattern recognition, and ma-

chine learning. He is the senior author of 20 technical papers, and two book chapters.

Mr. Hong's research focuses on pattern recognition, computer vision and computer graph-

ics with their applications in Human Computer Interaction. His work on face modeling,

facial motion analysis, and synthesis results in a face-based multimedia information con-

version interface. His work on unsupervised pattern extraction automatically searches for

temporal and spatial regularities in a large database.

Date post:	31-Dec-2019
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

ã Copyright by Pengyu Hong, 2000hong/Research/e_paper/thesis/thesis.pdf · AN INTEGRATED FRAMEWORK...

Documents