Copyright by Pengyu Hong, 2001
i
AN INTEGRATED FRAMEWORK FOR FACE MODELING, FACIAL MOTION ANALYSIS AND SYNTHESIS
BY
PENGYU HONG
BENGR, Tsinghua University, 1995 MENGR, Tsinghua University, 1997
THESIS
i
Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Computer Science
in the Graduate College of the University of Illinois at Urbana Champaign, 2001
Urbana, Illinois
iii
ABSTRACT
This dissertation presents an integrated framework for face modeling, facial motion
analysis, and facial motion synthesis. This framework systematically addresses three
closely related research issues: (1) selecting a quantitative representation of facial defor-
mation for face modeling and animation; (2) automatic facial motion analysis based on
the same visual representation; and (3) speech-to-facial-coarticulation modeling. The
framework provides a guideline for methodically building a face modeling and animation
system. The systematicness of the framework is reflected by the links among its compo-
nents, whose details are presented. Based on this framework, a face modeling and anima-
tion system, called the iFACE system, is developed. The system provides functionalities
for customizing a generic face model for an individual, text-driven face animation, off-
line speech-driven face animation, and real-time speech-driven face animation.
iv
ACKNOWLEDGMENTS
I would like to take this opportunity to express my appreciation to all the people who
have guided, supported, and encouraged me.
First, my sincere gratitude goes to my mentor, Professor Thomas S. Huang, for his ad-
vice, support, and encouragement. His kindness and humor made my research environ-
ment much more enjoyable. I thank my committee members, Professor Sylvian R. Ray,
Professor Michael Garland, and Professor David Goldberg for their invaluable comments.
I should particularly thank Dr. Harry Shum at Microsoft Research Lab and Dr. Jialin
Zhong at Bell Labs. In the summer of 1998, I worked for Dr. Zhong as an intern on the
text-to-visual-speech project. I worked for Dr. Shum as a summer intern on the facial mo-
tion modeling and tracking project in 2000. I really enjoyed the inspiring discussions with
Dr. Zhong and Dr. Shum and their invaluable suggestions. My continuous endeavors on
the above directions produce fruitful results that become important components of this
thesis.
I especially thank Professor Stephen E. Levinson. During my thesis work on speech-
driven face animation, Professor Levinson is always ready to provide his generous and
stimulating suggestions based on his expertise in speech processing and recognition.
My lab colleagues have also provided critical help. I especially thank Zhen Wen, You
Zhang, Larry Chen, Steve Chu, Roy Wang, and Ira Cohen. We worked together on the
AVATAR Demo for the Army Research Laboratory Symposium 2000. I specifically
thank Zhen Wen, my closest partner, who has been particularly cooperative and suppor-
tive. I thank other former and current graduate students in our group for the discussions
and support. Thanks also go to the Dr. Russell L. Storms and Dr. Larry Shattuck, who
kindly provided their face data for this research.
Many individuals provided additional valuable technical assistance. The constant, prompt
help of our system administrators, Gabriel Lopez-Walle, Rachael Brady, and Hank
v
Kaczmarski, made the research environment much more comfortable. The secretaries,
Sharon Collins, Wendy Harris, and Kathie Alblinger, are always cheerful and helpful.
Finally, there are others who have influenced this research indirectly, but fundamentally,
through their influence on my life. They are my parents, sister, and brother-in-law, whose
love, patience, and encouragement made this research possible. THANKS!!!
vi
TABLE OF CONTENTS
CHAPTER PAGE
1 INTRODUCTION.........................................................................................................1 1.1 Overview................................................................................................................ 1 1.2 Previous Research.................................................................................................. 2
1.2.1 Face modeling ............................................................................................. 2 1.2.2 Face animation ............................................................................................ 4 1.2.3 Facial motion analysis................................................................................. 9
1.3 The Approach – An Integrated Framework for Face Modeling, Facial Motion Analysis, and Synthesis ...................................................................................... 15
2 MOTION UNITS AND FACE ANIMATION ...........................................................17 2.1 Collect Training Data for Learning Motion Units............................................... 17 2.2 Learning Motion Units ........................................................................................ 20 2.3 Use MUs to Animate Face Model ....................................................................... 24
2.3.1 MU and key frame..................................................................................... 25 2.3.2 MU and MPEG-4 FAP.............................................................................. 26
2.4 Discussion............................................................................................................ 28
3 MU-BASED FACIAL MOTION TRACKING..........................................................30 3.1 Model Initialization ............................................................................................. 30 3.2 Tracking as a Weighted Least Square Fitting Problem ....................................... 31
3.2.1 Modelless tracking .................................................................................... 31 3.2.2 Constrained by MUs.................................................................................. 32
3.3 Improving the MU-based Facial Motion Tracking Algorithm............................ 34 3.4 Experimental Results ........................................................................................... 35 3.5 Discussion............................................................................................................ 37 3.6 3D MU-based Facial Motion Tracking................................................................ 40 3.7 3D MU-based Facial Motion Tracking Using Multiple Cameras ....................... 43 3.8 3D MU-BSV-based Facial Motion Tracking ...................................................... 47
4 MU-BASED REAL-TIME SPEECH-DRIVEN FACE ANIMATION......................51 4.1 Linear Audio-to-Visual Mapping ........................................................................ 51 4.2 Local Linear Audio-to-Visual Mapping .............................................................. 52 4.3 Nonlinear Audio-to-Visual Mapping Using ANN .............................................. 53 4.4 Experimental Results ........................................................................................... 54
4.4.1 Collect training and testing data................................................................ 54 4.4.2 Implementation.......................................................................................... 55
vii
4.4.3 Evaluation.................................................................................................. 55 4.4.4 A speech-driven face animation example ................................................. 58
5 THE IFACE SYSTEM................................................................................................62 5.1 Introduction.......................................................................................................... 62 5.2 Generic Face Model............................................................................................. 62 5.3 Customize the Face Model .................................................................................. 63 5.4 Face Deformation Control Model........................................................................ 67 5.5 Text Driven Face Animation ............................................................................... 70 5.6 Off-line Speech-driven face animation................................................................ 73 5.7 Real-Time Speech-driven face animation ........................................................... 75 5.8 The iFACE System in the Distributed Collaborative Environments................... 77
6 CONCLUSIONS AND FUTURE WORK .................................................................79 6.1 Summary.............................................................................................................. 79 6.2 Future Research ................................................................................................... 80
6.2.1 Explore better visual representation.......................................................... 80 6.2.2 Improve and evaluate the facial motion tracking algorithm ..................... 81 6.2.3 Refine audio-to-visual mapping ................................................................ 82 6.2.4 Human perception on synthetic talking face ............................................. 82 6.2.5 Improve the tongue models ....................................................................... 83
6.3 Improving the Key Frames of the iFACE System............................................... 83
REFERENCES..................................................................................................................84
VITA…….. .......................................................................................................................93
viii
LIST OF TABLES Table Page
Table 2.1 MU, ASM, AAM, eigenlips and eigen-points. ................................................. 23 Table 4.1 Real-time speech driven evaluation I. ............................................................... 56 Table 4.2 Real-time speech driven evaluation II............................................................... 57 Table 4.3 Real-time speech driven evaluation III. ............................................................ 57 Table 4.4 Real-time speech driven evaluation IV. ............................................................ 58 Table 5.1 Phoneme and viseme used in the iFACE system. ............................................. 73
ix
LIST OF FIGURES
Figure Page Figure 1.1 An integrated framework for face modeling, facial motion analysis, and
synthesis. ......................................................................................................... 16 Figure 2.1 An example of the labeled data and the mesh model. ..................................... 18 Figure 2.2 Facial muscles.................................................................................................. 19 Figure 2.3 MUs ................................................................................................................. 22 Figure 2.4 MPEG-4 feature points. ................................................................................... 26 Figure 2.5 The facial animation parameter units............................................................... 27 Figure 3.1 Model initialization for tracking. ..................................................................... 30 Figure 3.2 Comparison of the tracking results on an unmarked face using the MU-
based facial motion tracking algorithm, template matching, and the KLT trackers. ........................................................................................................... 36
Figure 3.3 Comparison of the tracking results on a marked face using the MU-based facial motion tracking algorithm, template matching, and the KLT tracker... 38
Figure 4.1 Local linear audio-to-visual mapping. ............................................................. 53 Figure 4.2 MLP for nonlinear audio-to-visual mapping. .................................................. 54 Figure 4.3 The estimation results of the global linear mapping........................................ 59 Figure 4.4 The estimation results of the local linear mapping. ......................................... 60 Figure 4.5 The estimation results of the nonlinear mapping using neural networks. ....... 61 Figure 5.1 The generic geometry face model.................................................................... 63 Figure 5.2 An example of the CyberwareTM cyberscanner data. ...................................... 64 Figure 5.3 The coarse model in 2D cylindrical coordinate space. .................................... 64 Figure 5.4 The landmarks divide the head surface into many local rectangular regions
in the cylindrical coordinate space. ................................................................. 65 Figure 5.5 Select feature points on the texture map.......................................................... 66 Figure 5.6 A semi-finished face model and the model editor. .......................................... 67 Figure 5.7 Examples of the customized face model. ........................................................ 67 Figure 5.8 The control model............................................................................................ 68 Figure 5.9 Local affine transformation for facial surface deformation............................. 69 Figure 5.10 Create facial shape using the model editor. ................................................... 69 Figure 5.11 Examples of facial expressions and visemes. ................................................ 70 Figure 5.12 The architecture of text driven face animation. ............................................. 72 Figure 5.13 The architecture of off-line speech-driven face animation. ........................... 74 Figure 5.14 An example of off-line speech-driven face animation................................... 75 Figure 5.15 An example of nonlinear real-time speech-driven face animation. ............... 76 Figure 5.16 A shoulder model is added to the face model................................................ 77 Figure 5.17 The iFACE system in a distributed collaborative environment..................... 78
x
1
CHAPTER 1
1 INTRODUCTION
Synthetic graphic talking faces provide an effective solution for delivering and displaying
communication information. The applications include 3D model-based very low bit rate
video coding for visual telecommunication [1], [55], video conferencing [13], and talking
head representation of computer agent [57], [84]. Research has consistently shown that
the perception of speech is inherently multimodal [48], [49], [74]. In noisy environ-
ments, a synthetic talking face can help users to understand the associated speech [48],
and it helps people react more positively in interactive services [61], for example, for E-
commerce. A synthetic talking face is also found to assist students learn better in com-
puter-aided education [16].
Graphic avatars have been developed to enhance conversational cues in multiple user
immersive collaboration environments (e.g., DIVE [11], GreenSpace [47], Interspace
[60], and NetICE [42]). An important research issue in developing avatars is how to natu-
rally and realistically animate the faces of the avatars. In many real world situations, such
as field collaboration, participants are mobile. Therefore, stable high bandwidth cannot be
guaranteed. A real-time speech-driven graphic avatar provides an effective solution. Dis-
tant participants can be represented as graphic avatars and displayed in the immersive
environments. The faces of the avatars are driven by speech that only requires very low
bandwidth to transmit.
1.1 Overview
The rest of the thesis is organized as follows. In the rest of Chapter 1, we review previous
research and present an integrated framework for building a face modeling, facial motion
analysis and synthesis system. The framework provides a guideline to systematically de-
velop face modeling and animation systems. The details of the framework will be de-
scribed in detail in Chapter 2, 3, and 4. First, a quantitative representation of facial de-
2
formation, called Motion Unit (MU), is introduced in Chapter 2. MU is the core compo-
nent of the framework. It will be shown how to use MU for realistic face animation. In
Chapter 3, MUs are used to develop a robust MU-based facial motion tracking algorithm.
In Chapter 4, the tracking algorithm is used to analyze facial movements and an audio-
visual dataset is collected. Two approaches for training real-time audio-to-visual map-
pings are described. Experimental results of the facial motion tracking and real-time au-
dio-to-visual mappings are shown. Based on this framework, we developed a face model-
ing and animation system, called the iFACE system [32], which will be presented in
Chapter 5. The demos of the iFACE system can be found at the following web page:
http://www.ifp.uiuc.edu/~hong/Research/face.htm. Finally, this thesis closes with some
conclusions and future research directions.
1.2 Previous Research
This section reviews previous research on face modeling, facial motion analysis and syn-
thesis, and speech-driven face animation. There has been a large amount of research on
face modeling and animation [24], [64]. One main goal of face modeling is to investigate
how to deform a facial surface spatially, or develop a facial deformation control model.
The key research issue of face animation is how to deform a facial surface temporally, or
construct a facial coarticulation model. To realistically animate the face model, analysis
of real facial motion is required for modeling the facial coarticulation. It has been shown
that facial coarticulation is highly correlated with the vocal track [87]. Speech is an im-
portant medium that has been used to drive a face model. Speech-driven face animation
not only needs to deal with face modeling and animation, but also needs to develop a
mapping from audio to facial coarticulation.
1.2.1 Face modeling
Human faces are commonly modeled as free-form geometric mesh models [32], [35],
[58], [75], [86], parameterized geometric mesh models [62], [63], [65], or physics-based
models [41], [76], [82]. Each face model has its deformation control model.
3
• Free-form face model
Free-form face model approaches explicitly define a control model to deform the face
model. Once the coordinates of the control points are decided, the remaining vertices on
the face model are deformed by interpolation. There are some popular interpolation func-
tions: affine functions [32], B-spline functions [58], cardinal spline and springs [80], ra-
dial basis functions [59], [86], the combination of affine functions and radial basis func-
tions [67], rational functions [35], or Bezier volume model [75].
One of the main research issues of free-form face modeling is how to design an interpola-
tion mechanism that is faithful for real face deformation. So far, no objective evaluation
experiments have been done for the above interpolation methods. If the density of the
control points is high enough, the above deformation methods can be used to approximate
the facial surface. However, high density of control points brings out another problem:
how to move those control points. Manual adjustment can achieve good results, but is
difficult and labor intensive. Automatic adjustment is itself an open question. Adjusting
control points needs to consider the relations among those control points, which are not
independent. The above free-form face modeling methods do not address the relations
among control points in a theoretically sound way.
• Parameterized face model
Parameterized mesh models use a set of parameters to decide the shapes of the face mod-
els [62], [63], [65]. The coordinates of some anchor vertices are first calculated using a
set of predefined functions whose variables are those parameters. The coordinates of the
remaining vertices are then calculated by a set of predefined interpolation functions
whose variables are those parameters and the coordinates of those anchor vertices. How-
ever, there is no systematic way or theoretical basis for designing those functions (both
for the anchor vertices and the rest vertices), deciding the values of the parameters in the
function, and choosing anchor vertices.
4
• Physics-based face model
Physics-based models simulate facial skin, tissue, and muscles by multilayer dense
meshes [41], [76], [82]. Facial surface deformation is triggered by the contractions of the
synthetic facial muscles. The muscle forces are propagated through the skin layer, and
thereby deform the facial surface. The simulation procedure solves a set of dynamics
equations. However, the sophistication of the physical models of facial muscles, skin and
tissue makes physics-based model approaches computationally intensive. In addition, de-
termining the parameters of the physics-based face models is an art.
1.2.2 Face animation
Once the facial deformation control model is decided, a face model can be animated by
temporally adjusting its parameters according to its facial coarticulation model.
• Function-based facial coarticulation model
Some approaches explicitly model facial coarticulation by some forms of functions [15],
[66]. Pelachaud et al. [66] used a look-ahead model for visual speech synthesis. They use
the Facial Action Coding System (FACS), which is proposed by P. Ekman and W.
Friesen [23], to describe facial deformations. FACS is based on anatomical studies on
facial muscular activity and it enumerates all Action Units (AUs) of a face that cause fa-
cial movements. Currently, FACS is widely used as the underlying visual representation
for facial motion analysis, coding, and animation. In their face animation system [66],
AUs are manually designed and are assumed to be additive. In other words, facial defor-
mation can be calculated by linear combination of AUs. Phonemes1 are assigned with
high or low deformability ranks. A set of forward and backward coarticulation rules is
intuitively designed to link the speech intonation and emotion with facial deformation.
The rules describe a set of functions that are used to compute the intensity of a facial ac-
tion unit in proportion to the speech rate.
1 A phoneme is a member of the set of the smallest units of speech that serve to distinguish one utterance from another in a language or dialect.
5
Cohen and Massaro [15] used a parameterized geometrical face model, which is a de-
scendant of Parke’s face model [62]. They adopt the Löfqvist gestural production model
[45] as the facial coarticulation model to drive the face model. Although the Löfqvist ges-
tural production model is based on empirical observations, it is explicit form is designed
subjectively. In addition, the Löfqvist gestural production model requires that the pho-
neme sequence should be known.
Neither approaches described in [15] or [66] are appropriated for real time online face
animation. In addition, the coarticulation functions are designed subjectively and may not
well represent real facial dynamics.
• Performance-driven face animation
The philosophy of performance-driven face animation approaches is that wool comes
from the sheep. This kind of approach automatically analyzes real facial movements us-
ing computer vision techniques. The analysis results are used to animate graphic face
models. Therefore, they can achieve natural face animation by using information about
real facial deformation.
Williams [86] and Guenter et al. [29] used simple computer vision techniques to track the
markers on the face of a human subject. The tracking results are used directly to deform
the face models. In [86], each marker corresponds to a control point on the face model. A
set of warping kernels is designed and used to deform the vertices around the control
points. In [29], vertices are moved by a linear combination of the offsets of the nearest
markers. Nonetheless, both Williams [86] and Guenter et al. [29] required intrusive
markers be put on the face of the subject. As will be discussed in Section 1.2.3, facial
motion analysis is very difficult if the face is not marked.
Other performance-driven face animation systems adopt analysis-based approaches [22],
[25], [44], [75], [77]. Analysis-based approaches extract information from a live video
sequence and use the extracted information for face animation. Such information corre-
sponds to muscle contractions of the physics-based face model in [77], the weights of the
AUs of the FACS in [44], [75], [77], or Motion Picture Experts Group 4 (MPEG-4) face
6
animation parameters (FAPs) [92] in [22]. The face of the subject in [77] is marked to
guarantee accurate tracking results. The tracking techniques used in [22], [44], [75], [77]
will be discussed in Section 1.2.3.
Although temporal information is correctly extracted to some degree, subjectivity is in-
troduced while deforming the face models. Physics-based face models require manually
deciding the values of a large number of physical parameters [77]. FACS is originally
proposed for psychology research [23]. It is more or less subjective and does not provide
quantitative information about face deformation. Users have to manually design AUs for
their face model. MPEG-4 FAPs provide the movement of only some facial features that
can be thought as the control points of the face model. The rest of the face model still has
to be deformed by some predesigned warping/interpolation functions, which should be
addressed by facial deformation control models.
Overall, the problems are related to the facial deformation control models, which either
decide how to change the values of control parameters or are used to design AUs. If the
control model is just used for animation, only the animation results will be affected. The
tracking results will be greatly degraded if either inaccurate control models are used by
the tracking step or the animation results are fed back to the tracking step. Of course, cor-
rupted tracking results will further result in bad animation results. Therefore, research on
facial deformation control model, facial motion analysis, and face animation should be
carried out systematically.
• Speech-driven face animation
A problem with the performance-driven face animation approach is the speed and accu-
racy of its facial motion analysis algorithm. It requires high computation power in order
to obtain robust and accurate facial motion analysis results without putting intrusive
markers on the face of the actor/actress. An alternative way to drive the face model is
speech-driven face animation, which is more efficient than performance-driven face ani-
mation. This kind of approach takes advantage of the tight correlation between speech
and facial coarticulation. It takes speech signals as input and outputs a face animation se-
quence.
7
The audio-to-visual mapping is the main research issue of speech-driven face animation.
The audio information is usually represented as feature vectors of speech, for example,
linear predictive coding (LPC) Cepstrum, Mel-frequency cepstral coefficients (MFCC),
and so on. The visual information is usually represented as the parameters of the facial
deformation control model, for example, the weights of AUs, MPEG-4 FAPs, the coordi-
nates of control vertices of the face model, and so on. The mappings are learned from an
audio-visual training data set, which are collected in the following way. The facial
movements of talking subjects are tracked either manually or automatically. The tracking
results and the associated audio tracks are collected as the audio-visual training data.
Some speech-driven face animation approaches use phonemes or words as intermediate
representations. Lewis [43] used linear prediction to recognize phonemes. The recognized
phonemes are associated with mouth positions to provide keyframes for face animation.
However, the phoneme recognition rate of linear prediction is very low. Video Rewrite
[10] trains hidden Markov models (HMMs) [69] to automatically label phonemes in both
training audio track and new audio track. It models short-term mouth co-articulation us-
ing triphones. The mouth images for a new audio track are generated by reordering the
mouth images in the training footage, which requires a very large database. Video Re-
write is an offline approach and needs large computation resources. Chen and Rao [14]
train HMMs to segment the audio feature vectors of isolated words into state sequences.
Given the trained HMMs, the state probability for each time stamp is evaluated using the
Viterbi algorithm. The estimated visual features of all states can be weighted by the cor-
responding probabilities to obtain the final visual features, which are used for lip anima-
tion.
Another kind of HMM approach tries to map audio patterns to facial motion trajectories.
Voice Puppetry [8] uses an entropy minimization algorithm to train HMMs for the audio
to visual mapping. The mapping estimates a probability distribution over the manifold of
possible facial motions from the audio stream. A globally optimal closed-form solution is
derived to determine the most probable series of facial control parameters, given the be-
ginning and ending values of the parameters. An advantage of this approach is that it does
not require automatically recognizing speech into high-level meaningful symbols (e.g.,
8
phonemes, words), which is very difficult to obtain a high recognition rate. However, this
approach is an offline method.
Other approaches attempt to generate instantaneous lip shapes directly from each audio
frame using vector quantization, Gaussian mixture model, or artificial neural networks
(ANN). Vector quantization [53] is a classification-based audio-to-visual conversion ap-
proach. The audio features are classified into one of a number of classes. Each class is
then mapped onto a corresponding visual output. Though it is computationally efficient,
the vector quantization approach often leads to discontinuous mapping results. The Gaus-
sian mixture approach [70] models the joint probability distribution of the audio-visual
vectors as a Gaussian mixture. Each Gaussian mixture component generates an optimal
linear estimation for a visual feature given an audio feature. The estimations are then
nonlinearly weighted to produce the final visual estimation. The Gaussian mixture ap-
proach produces smoother results than the vector quantization approach. However, nei-
ther approach described in [53] and [70] consider phonetic context information, which is
very important for modeling mouth coarticulation during speech. Moreover, they are lin-
ear mappings while the mapping from audio information to visual information is nonlin-
ear in its nature.
Neural network based approaches try to find nonlinear audio-to-visual mappings. Mor-
ishima and Harashima [54] trained a three layer neural network to map LPC Cepstrum
speech coefficients of one time step speech signals to mouth-shape parameters for five
vowels. Kshirsagar and Magnenat-Thalmann [39] also trained a three-layer neural net-
work to classify speech segments into vowels. Average energy was then used to modulate
the lip shapes of the recognized vowel. Nonetheless, again, the approaches described in
[39] and [54] do not consider phonetic context information, which is very important for
modeling mouth coarticulation during speech. In addition, they mainly consider the
mouth shapes of vowels and neglect the contribution of consonants during speech.
Massaro et al. [50] trained multilayer perceptrons (MLP) to map LPC cepstral parameters
to face animation parameters. They try to model the coarticulation by considering the
speech context information of five backward and five forward time windows. Another
9
way to model speech context information is to use time delay neural networks (TDNNs)
model, which uses ordinary time delays to perform temporal processing. Lavagetto [40]
and Curinga et al. [20] train TDNN to map LPC cepstral coefficients of speech signal to
lip animation parameters. TDNN is chosen because it can model the temporal coarticula-
tion of lips and is more computationally efficient than HMM. Nevertheless, the artificial
neural networks used in [20], [40], [50] require a large number of hidden units, which
results in high computational complexity during the training phrase. Vignoli et al. [81]
use self-organizing maps (SOM) as a classifier to perform vector quantization functions
and feed the classification results to a TDNN. SOM reduces the dimension of input of
TDNN so that it reduces the parameters of TDNN. Therefore, the computational com-
plexity of TDNN is reduced. However, SOM is a hard decision classifier and its recogni-
tion results do not encode mouth coarticulation information. In order to reduce the input
dimension, SOM can only have a few nodes, which results in losing important audio in-
formation.
1.2.3 Facial motion analysis
As is shown in Section 1.2.2, facial motion analysis is very important. The analysis re-
sults can be used to directly drive the face model or train audio-to-visual mappings. There
has been a large amount of work done on facial feature tracking. Simple approaches only
utilize low-level image features. Their computational complexity is low and suitable for
real time tracking tasks. For example, Goto et al. [28] extract edge information to find
salient facial feature regions (eyes, lips, etc). The extracted low-level image features are
compared with templates to estimate the shapes of the facial features.
However, it is not robust enough to use low-level image features alone. The errors will
quickly accumulate with the increase in number of frames being tracked. High-level
knowledge has been used to tackle this problem by imposing constraints on the possible
shapes/deformations of facial features. It has been shown that high-level knowledge is
essential for robust facial motion tracking. The tracking algorithm combines information
derived from low-level image processing and the high-level knowledge model to track
facial features. The high-level knowledge is usually explicitly represented as some kind
10
of shape model for facial features. Different model-based tracking algorithms differenti-
ate themselves by their shape models and their low-level image processing steps. We
summarize different model-based facial feature tracking algorithms according to their
high-level knowledge models as below.
• B-spline curve
Blake et al. [6] proposed parametric B-spline curves for contour tracking. The tracking
problem is to estimate the control points of the B-spline curve so that the B-spline curve
matches the contour being tracked as closely as possible. A Kalman filter is incorporated
to track objects with high contrast edges. However, without global constraints, B-spline
curves tend to match contours locally, resulting in wrong matching among contour points,
which is called the sliding effect and is similar to the aperture problem in optical flow
calculation. The robustness of the algorithm could be improved by employing more sto-
chastic motion models [7]. The motion models can be learned from examples to represent
specific motion patterns. The motion model superimposes a constraint on the possible
solution subspace of the contour points. Therefore, it prevents generating physically im-
possible curves.
However, the absence of a sharp jump or sudden change around the lip boundary makes it
difficult to reliably track lip contours. Instead of using grey-level edge information, Kau-
cic and Blake [37] and Chan [12] utilized the characteristics of human skin color. They
proposed using either Bayesian classification or linear discriminant analysis to distin-
guish lips and other areas of facial skin. Therefore, the contours of the lips can be ex-
tracted more reliably. It is well known that color segmentation is sensitive to lighting
conditions and the effectiveness of color segmentation depends on the subject. This can
be partially solved by training a color classifier for each individual. Nevertheless, the ap-
proaches described in [37] and [12] do not deal with rotation, translation and scaling of
lips.
• Snake
The Snake is first proposed by Kass et al. [36]. It starts from a given starting point and
deforms itself to match with the nearest salient contour. The matching procedure is for-
11
mulated as an energy minimization process. In basic Snake-based tracking, the function
to be minimized includes two energy terms: (1) internal spline energy caused by stretch-
ing and bending, and (2) measure of the attraction of image features such as contours. B-
spline [6] is a “least squares” style Snake algorithm (a Kalman filter). Snakes rely on
gray-level gradient information while measuring the energy terms of the snakes. How-
ever, it is well known that gray-level gradients are inadequate for identifying the outer lip
contour [88]. Therefore, the facial features being tracked are highlighted by makeup in
[77]. Otherwise, Snakes very often align onto undesirable local minima.
To improve Snakes, Bregler and Konig [9] propose eigenlips that incorporate a lip shape
manifold into Snake tracker for lip tracking. The shape manifold is learned from training
sequences of lip shapes. The function of the shape manifold is similar to the stochastic
motion model for B-spline in [7]. It imposes global constraints on the Snake. The local
search for maximum gray-level gradients is guided by the globally learned lip shape
space.
• Deformable template
In [88], a facial feature is defined as a deformable template, which includes a parametric
geometrical model and an imaging model. Deformable template treats tracking as an in-
terpretation by synthesis problem. The geometrical model describes how the shape of the
template can be deformed and is used to measure shape fitness of the template. The imag-
ing model describes how to generate an instance of the template and is used to measure
the intensity fitness of the template. An energy function is designed to link different types
of low-level image features, e.g., intensity, peaks, valleys, and edges, to the correspond-
ing properties of the template. The parameters of the template are calculated by steepest
descent. Both B-spline and Snake can be thought of as special cases of deformable tem-
plate, which only utilize edge information in the image. Nevertheless, the parametric fa-
cial feature models are usually defined subjectively.
• ASM, AAM, and eigen-points
Active Shape model (ASM) [17], Active Appearance model (AAM) [52], and eigen-
points [19] utilize both contour and appearance to model the facial features. The motiva-
12
tions of ASM, AAM, and eigen-points are similar to that of the deformable template [88].
They all treat tracking as an interpretation by synthesis problem. ASM, AAM, and eigen-
points try to achieve robust performance by using the high-level model to constrain solu-
tions to be valid examples of the object being tracked. The appearance of the object is
explained by the high-level model as a compact set of model parameters. The models
used by ASM, AAM, and eigen-points are the eigen-features of the object modeled.
ASM and eigen-points model the shape variation of a set of landmark points and the tex-
ture variation in the areas around landmark points. AAM models the whole shape and the
appearance of the object. All of them require manually labeling training data, which is
labor intensive. In order to handle various lighting conditions, the texture part of the
training data should cover broad enough lighting conditions. Both ASM and AAM rely
on iterative solutions. The eigen-points approach avoids the iterative procedure. Instead,
it estimates the parameters in a sequence of matrix operations: orthogonal project, scal-
ing, and orthogonal projection.
Since all of the three methods model the texture of the object, the user cannot put markers
on the object. The training data need to be carefully labeled so that the correspondences
between the landmarks across training samples are physically correctly established.
• Parametric 3D model
DeCarlo and Mataxas [21] propose an approach that combines a deformable model space
and multiple image cues (optical flow and edge information) to track facial motions. The
edge information used is chosen around certain facial features, such as the boundary of
the lips and eyes, and the top boundary of the eyebrows. To avoid high computation
complexity, optical flow is calculated only for a set of image pixels. Those image pixels
are chosen in the region covered by the face model using the method proposed by Shi and
Tomasi [72]. The deformable model [21] is a parametric geometric mesh model. The pa-
rameters are manually designed based on a system of anthropometric measurements of
the face. By changing the values of the parameters, the user can obtain a different basic
face shape and deform the basic face shape locally.
13
The deformable model [21] helps to prevent producing unlikely facial shapes during
tracking. It does have limitations in its coverage. There are many facial motions that can-
not be represented accurately, for example many of the lip deformations produced during
speech.
• FACS-based 3D model
Some facial motion tracking algorithms design the high-level models based on AUs de-
fined by FACS [25], [44], [75]. The FACS based 3D models impose constraints on the
subspace of the plausible facial shapes. The motion parameters are separated as global
face motion parameters (rotation and translation) and local facial deformation parameters,
which correspond to the weights of AUs in [44], [75] and to the FACS-like control pa-
rameters in [25]. First the movements of the vertices on the model are calculated using
some kinds of optical flow calculation techniques. The optical flow results are usually
noisy. The model is then added to constrain the optical flow. The motion parameters are
calculated by least square estimator.
However, FACS was originally proposed for psychology study and does not provide
quantitative information about facial deformations. To utilize FACS, researchers need to
manually design the parameters of their model to obtain the AUs. In [44], a parametric
geometrical face model, called Candide, was used. The Candide model contains a set of
parameters for controlling facial shape. In [75], Tao and Huang used a piecewise Bezier
volume deformable face model, which can be deformed by changing the coordinates of
the control vertices of the Bezier volumes. In [25], Essa and Pentland extended a mesh
face model, which was developed by Platt and Badler [68], into a topologically invariant
physics-based model by adding anatomy-based “muscles,” which is defined by FACS.
Overall, the basic problem still lies on the facial deformation control models, which ei-
ther decide how to change the values of parameters or are used to design AUs. If the con-
trol model is used just for animation, only the animation results will be affected. The
tracking results will be greatly degraded if either inaccurate control models are used by
the tracking step or the animation results are fed back to the tracking step. Of course, cor-
rupted tracking results will further result in bad animation results.
14
• MPEG-4 FAP-based B-spline surface
Eisert et al. [22] proposed a model-based analysis-synthesis loop to estimate face and fa-
cial motions. The head model is assumed to be a B-spline surface. The shape of the head
model is determined by 231 control points of the B-spline surface. The head model is tex-
ture mapped. They represent the motion and deformation of the 3-D head model by the
facial animation parameters (FAPs) based on the MPEG-4 standard. An intensity gradi-
ent-based approach that exploits the 3-D head model information is used to estimate the
FAPs directly. The problem is also formulated as a least square model fitting problem
given (1) the possible deformations in the head model that can be controlled by the FAPs
and (2) the constraints on the changes in the FAPs between two successive frames. How-
ever, it is an art to design a B-spline surface model so that it can well represent a facial
surface. The problems mainly arise on the number of the control vertices and the loca-
tions of the control points.
Basically, the underlying idea of the approach in [22] is same as those of the approaches
described in [25], [44], [75]. They all form a model-based analysis-synthesis loop except
that their models appear in different forms. Therefore, Eisert et al.’s approach [22] should
face the same problem confronted by the approaches proposed in [25], [44], [75].
• 3D model learned from real data
Some approaches train their 3D models using a set of labeled real facial deformation data
[3], [71]. The approaches presented in [3] and [71] only deal with lips. The trained 3D
model is able to encode the information of real lip deformations. Color classes for the lips
and face are trained to estimate the class probability of each pixel. The tracking problem
is formulated as finding the lip shape within the subspace that maximizes the posterior
probability of the model given the observed color features of lips and facial skin. In [3],
the methods used for estimating the parameters in [3] and [71] are variances of gradient
ascent.
15
1.3 The Approach – An Integrated Framework for Face Modeling, Facial Motion Analysis, and Synthesis
Given such a significant amount of previous research on face modeling, facial motion
analysis, and synthesis, it remains unclear what are the most important aspects that lead
to realistic face animation. This thesis advocates that the research on face modeling, fa-
cial motion analysis, and synthesis should be carried systematically. Both the facial de-
formation model and the coarticulation model should be based on extensive analysis of
real facial movement data. A framework is needed to guide this research. It is the major
contribution of this thesis to present an integrated framework (Figure 1.1) for face model-
ing, facial motion analysis, and synthesis.
A set of MUs is used as the quantitative visual representation of facial deformations. The
same visual representation is used both by face animation and facial motion analysis.
MUs are learned from a set of labeled real facial shapes and are used for face modeling.
Arbitrary facial deformation can be approximated by a linear combination of MUs, which
are weighted by MU parameters (MUPs). We can animate the face model by adjusting
MUPs. It will be shown that the MU-based face animation technique is compatible with
existing popular face animation techniques/standards, such as key frame techniques and
MPEP-4 FAP.
Within this framework, a MU-based facial motion tracking algorithm is presented. MUs
are used as the high-level knowledge by the tracking algorithm to attain robust facial mo-
tion analysis results. The tracking results are represented as a MUP sequence. The track-
ing results can be used directly for face animation or can be used for other train-
ing/recognition purpose.
A set of facial motion tracking results and the corresponding audio tracks are collected as
the audio-visual database. Machine learning techniques are applied to training two real-
time speech-driven face animation algorithms using the collected audio-visual database.
The algorithms map audio features to MUPs, which are used to animate face models via
MU-compatible face animation techniques.
16
In the following three chapters, the details of each part of the framework are presented
and discussed. Experimental results are shown.
Figure 1.1 An integrated framework for face modeling, facial motion analysis, and synthesis.
Motion Units
Labeled facial deformations
Learn facial deformations
MU-based facial motion analysis
Convert speech to MUPs
Face image sequence
Train speech to MUP mapping
Speech stream
Real-time speech to MUP
Mapping
New speech stream
MU-based face animation
MUP sequence
MUP sequence
Video database
Graphic face animation sequence with texture
17
CHAPTER 2
2 MOTION UNITS AND FACE ANIMATION
The framework requires a basic information unit to establish information flow and link
all its active components together.1 This basic information unit is the visual representa-
tion of facial shape and deformation, which has been an important issue since the emer-
gence of computer face animation. The visual representation should be suitable for com-
putation and have sound representation power.
This chapter presents the MU as the quantitative visual representation. MU is inspired by
the AU of the FACS [23]. The main difference is that MUs are learned from real facial
deformation data and encode the characteristics of real facial deformations. Therefore
MUs are more suitable for computing purposes and synthesizing natural facial move-
ments. Currently, most existing facial surface models are mesh models. Therefore, the
appearance of the MU is set as mesh in this thesis. MUs directly model the facial surface
without using other intermediate control model.
2.1 Collect Training Data for Learning Motion Units
We mark sixty-two points around the subject’s lower face (Figure 2.1(a)). The number of
the markers affects the representation capacity of the MU. More markers will enable MU
to encode more information. Depending on the need and context of the system, the user
can flexibly decide the number of the markers while still following the guidance provided
by this framework. The only guideline is to put more markers in the areas where muscle
distributions are more complicated, such as lips.
1 The appearance of the basic information unit or the way to calculate it could be different as long as it is qualified for the information flow connecting the components of the framework.
18
Currently, only 2D motion of the lower face is considered. This is due to the comprehen-
sive consideration on the effectiveness and the required time to develop the demo of the
framework.
Firstly, the lower face contributes to the most complicated part of the facial movements.
This is evidenced by the anatomy of facial muscles (Figure 2.2) and the underlying struc-
ture of the skull. The configuration of the upper facial muscles (forehead and eyelids) is
much simpler than that of the lower facial muscle. Therefore, the upper face can only de-
form in a simpler way. The natural movements of the eyelids include only open and
close. The only movable part of the skull is the jaw, whose movements affect only the
deformations of the lower facial surface. Therefore, the lower face can have more com-
plicated deformations than the upper face.
Secondly, as long as facial motion during speaking is considered, the movement of the
lower face and that of the upper face are independent. If the expressions are considered,
the training data should cover the movements of the upper face. If the face model to be
animated is always facing the user without turning around, 3D deformation of the face
model will not be an issue. Even though 3D deformation must be considered, it will be
shown in Section 2.4 that 2D MUs can be used to infer the deformation along the third
axis.
Figure 2.1 An example of the labeled data and the mesh model.
(a) Markers (b) Mesh
19
Thirdly, the facial motions in the lower face have little influence on the facial motions in
the upper face and vice versa [23]. Hence, we can treat them separately.
Future work will have more markers and will cover 3D motion of the whole face. The
same framework can be followed. It needs to be emphasized again here. The contribution
of this framework is to provide a systematic guideline for building a face modeling and
animation system. It is the spirit of the framework to be transmitted and propagated.
A mesh model is created according to those markers (Figure 2.1(b)). The lines among
those vertices are just for visualization purposes now. Exploring the adjacent relations
among those points, which are represented by those lines, is an interesting research topic
that can be investigated in the future. This mesh model is further used in a facial motion
tracking algorithm which will be described in Chapter 3. The subject is asked to wear a
pair of glasses where two additional points are marked. Since the glasses only undergo
rigid motion, those two points on the glasses can be used for data alignment.
Figure 2.2 Facial muscles.2
2 Source: http://predator.pnb.uconn.edu/beta/virtualtemp/muscle/Muscle-Anatomy-Pages/Anatomy-Pages/anatomy-facial.html
20
We attempt to include as great a variety of facial deformations as possible in the training
data and capture video of facial movements for pronouncing all English phonemes. The
video is digitized at 30 frames per second, which results in over 1000 samples. The
markers are automatically tracked by zero-mean normalized cross-correlation template
matching technique [27]. A graphic interactive interface is developed for the user to cor-
rect the positions of trackers when the template matching fails due to large face or facial
movements. In that interface, each tracker corresponds to a vertex on the mesh. The use
can use a mouse to drag the vertices of the mesh and consequently change the positions
of the trackers. The tracking results are aligned by rotation, scaling, and translation so
that the two markers on the glasses are coincident for all the data samples.
2.2 Learning Motion Units
Principal component analysis (PCA) [34] has been extensively used to model the signifi-
cant characteristics of the samples [38], [51], [79]. In this work, PCA is also used to learn
a set of MUs that span the facial deformation space. Although lip shapes may differ from
person to person, we hope that the deformation space is more consistent.
A data sample in the training set }{sS r= is represented as a vector Tnn xxyxs ],,,,[ 11 L
r =
(n = 62), which is formed by concatenating the coordinates of the markers after normali-
zation. Let 0sr be the neutral facial shape. The deformation vector of each data sample is
calculated as 0ssd iirrr
−= , i = 1, …, , P, where M is the size of the training data set. In
this way, we obtain the deformation vector set of the training data set as
},,{ 1 PddDr
Lr
= .
The mean and the covariance matrix of D are calculated by ][0 idEmr
= and
]))([( 00T
ii mdmdE rrrr−−=Σ . The eigenvectors and eigenvalues of Σ are calculated. The
first K (in our case, K = 7) significant eigenvectors 1mr , 2mr , …, Kmr , which correspond to
the largest K eigenvalues, are selected. They account for 97.56% of the facial deforma-
tion variation in the training data set. More eigenvectors can be chosen. However, in this
case, the representation power of chosen eigenvectors, in term of the preserved facial de-
21
formation variation, increases little while the number of chosen eigenvectors increases
over 7.
We call },,,{ 10 Kmmm rK
rr the MU set. The chosen MUs are illustrated in Figure 2.3. Each
mesh in Figure 2.3 is derived by 0sr + ρ imr (ρ = 25). They respectively represent the mean
deformation and local deformations around lips, mouth corners, and cheeks.
Any facial shape sr and the corresponding deformation vector dr
can be represented by
dss
mcmdK
iii
rrr
rrr
+=
+= ∑=
0
10 (2.1)
where {ci} is the MU parameter set and iT
i mmssc rrrr )( 00 −−= , i = 1, …, K. By adjusting
ci, we can obtain different facial shapes in the space defined by MUs.
MU is related to the eigen-model in our previous work [30], [31], the ASM [17], AAM
[18], eigenlips [9], and eigen-points [19]. All the above visual representations are learned
using PCA and are applied to modeling facial features. The shared underlying assumption
is that the distribution of facial (or facial feature) deformations/shapes or appearance is
Gaussian. Any instance can be approximated by a linear combination of some bases
learned by PCA. In [30] and [31], the PCA learning results are further used to synthesize
new mouth movement sequences. Table 2.1 lists the properties of MU, ASM, AAM, ei-
genlips, and eigen-points. MU does not model the facial appearance. Modeling appear-
ance is very difficult. In order to handle various lighting conditions and races, the texture
part of the training data should cover broad enough samples, which require intensive
manual work. Moreover, such a large size of texture training database will be beyond the
modeling capacity of PCA because PCA assumes the training data are Gaussian.
22
Figure 2.3 MUs
(a) MU 0 (b) MU 1
(d) MU 3 (c) MU 2
(e) MU 4 (f) MU 5
(h) MU 7 (g) MU 6
23
Table 2.1 MU, ASM, AAM, eigenlips and eigen-points.
MU eigenlips eigen-points ASM AAM Learned from real data
Yes Yes Yes Yes Yes
Modeling the shape of the object
Yes No Yes Yes Yes
Geometric appearance of the model
Triangular mesh
Snake curves
(contours of the lips)
Point cloud Curves which consist of land
markers around the
contour of fa-cial features
Same to ASM
Modeling ap-pearance of the object No
Model whole lips
Model the texture in small regions around each feature point.
Model the tex-ture in small regions around each landmark point.
Model the global ap-pearance
Modeling the joint varia-tion of the appearance and the shape
No No Yes No No
MU is also related to the lip models reported in [2] and [71]. The lip model in [71] has 30
control points. The remaining vertices are generated by cubic interpolation curves, which
introduce artificial effects. Moreover, the training data in [71] are collected by subjec-
tively manually adjusting the control points. The training data set of MU and that in [2]
provide ground truth information because they are collected by putting markers on the
face. However, [2] uses a complicated physics-based control model that increases the
computational complexity. As we will show in the rest of this chapter, it is efficient and
sufficient to directly model facial deformation for animation purposes.
In fact, MUs can be used to deform the control points of facial deformation control mod-
els by designing MUs to include elements that correspond to those control points. This
24
can be easily achieved by making the marker set on the face of the subject cover those
control points while the training data of MUs is collected. An advantage of MU is that the
way to calculate MUs explores the correlations among markers. Since MUs will be used
in facial motion tracking later, it is better to let MUs have more points than the number of
the control points. This is because the number of control points on a face model is usually
too small. It is difficult to achieve robust facial motion analysis by only tracking a small
number of points because not all points can be independently tracked accurately enough.
This will be illustrated in the MU-base facial motion tracking algorithm, which is de-
scribed in Chapter 3.
2.3 Use MUs to Animate Face Model
MUs have many good properties. Firstly, MUs are learned from real data and encode the
characteristics of facial deformations. Secondly, compared to the number of the vertices
on the face model, the number of the MUs is much smaller. Since any 2D facial deforma-
tion can be represented by a linear weighted combination of MUs, we only need to adjust
a few parameters in order to animate the face model. This dramatically reduces the com-
plexity of face animation and makes MUs especially suitable for the facial motion track-
ing algorithm, which will be described in Chapter 4. In addition, MUs are orthogonal to
each other. Therefore, it is computationally efficient to calculate MUPs for any facial de-
formation.
It will be shown below that the MU-based face animation technique is compatible many
with existing face animation techniques. This is very important from both the academic
research point of view and the industrial point of view. Key frame techniques3 and the
MPEG-4 face animation standard are widely used in existing face animation systems. We
will show that it only requires simple matrix operations to achieve: (1) the conversions
between MUPs and key frame parameters; and (2) the conversions between MUPs and
the MPEG-4 FAPs. The techniques that will be presented in Sections 2.3.1 and 2.3.2 en-
able users to clone facial motion while using different face models.
3 Only those key frame techniques that use linear combination of key frames are considered here.
25
The major advantage of MU over Key-frame and MPEG-4 FAPs is that MU is supported
by real facial movements. More precisely, the spatial and temporal characteristics of real
face animation can be encoded in MUs and MUP sequence, which are derived from real
facial movements. Key-frames do contain detailed spatial facial deformation information.
However, there is no theoretical guidance for temporally adjusting the key-frame parame-
ters to achieve natural face animation. MPEG-4 face animation standard also has the
same problem.
2.3.1 MU and key frame
Key frame approaches animate face models by linearly combining a set of key frames,
say α key frames. Without losing generality, we assume that each key frame represents
facial deformation. We can select a set of facial shapes },,{ 1 αkkr
Kr
in the training data
set of MUs, so that there is a correspondence between },,{ 1 αkkr
Kr
and the key frames.
Since those key frames usually correspond to a set of meaningful facial shapes (e.g.,
laughing, smiling, visemes, and so on), it is easy to choose },,{ 1 αkkr
Kr
from the training
set of MUs. A facial deformation dr
in the animation sequence can be represented as
∑ =
α
1i ii kar
, and dr
can also be represented by MU as ∑ =
K
i ii mc1
r . The conversion between
ci and ai can be easily achieved by
=
α
α
a
akkmm
c
cT
K
k
Mr
Lrr
Lr
M1
11
1
][][ (2.2)
and
ΓΓΓ=
−
k
KTT
c
cmm
a
aM
rL
rM
1
11
1
][)(
α
, where ][ 1 αkkr
Lr
=Γ (2.3)
Our iFACE [32] system, which will be presented in Chapter 5, was first developed to use
the key-frame technique to animate the face model because the simplicity of the key-
26
frame technique. Equation (2.3) enables us to easily modify the iFACE system to adopt
MUs for face animation.
2.3.2 MU and MPEG-4 FAP
The MPEG-4 standard defines 68 FAPs. Among them, two FAPs are high-level parame-
ters (viseme and expression), and the others are low-level parameters that describe the
movements of facial features (see Figure 2.4) defined over jaw, lips, eyes, mouth, nose,
cheek, ears, and so on [92].
Figure 2.4 MPEG-4 feature points.
The movement represented by each FAP is defined with respect to a neutral face and is
expressed in terms of the FAP units (FAPUs) (Figure 2.5). The FAPUs correspond to
fractions of the distances between a set of salient face features, such as eye separation,
27
mouth-nose separation, etc. These units are defined in order to allow a consistent inter-
pretation of FAPs on any face model.
Figure 2.5 The facial animation parameter units.
The high-level parameters of MPEG-4 FAP describe visemes and expressions, but do not
provide temporal information. The low-level parameters of MPEG-4 FAP can represent
the temporal information by the change of values of the parameters. However, they only
describe the movements of 66 facial features and lack detailed spatial information to
animate the whole face model. Most people use some kind of interpolation functions to
animate the rest of the face model.
MUs are learned from real facial movements. The advantages of MUs over MPEG-4 FAP
include the following: (1) MUs encode detailed spatial information for animating a face
model, and (2) real facial movements can be easily encoded as MUPs using Eq. (2.1) for
animating face models.
If the values of MUPs are known, the facial deformation can be calculated. Consequently,
the movements of facial features used by MPEG-4 FAPs can be calculated. It is then
straightforward to calculate the values of MPEG-4 FAPs. On the other hand, if the values
of MPEG-4 FAPs are known, we can calculate MUPs in the following way. First, the
movements of the facial features are calculated. The concatenation of the facial feature
movements forms a vector pr . Then, we can form a set of vectors, say },...,,{ 21 Kfffrrr
, by
ESo
ENSo
MNSo
MWo
28
extracting the elements corresponding to those facial features from the MUs
},,{ 1 Kmm rK
r . The vector elements of },...,,{ 21 Kfffrrr
and those of pr are arranged so that
the information of facial features is represented in the same order in the vectors. The
MUPs can be calculated by:
)()( 01
1
mpFFFc
cTT
k
rrM −=
− , where [ ]KffF
rL
r1= (2.4)
The markers must include those facial feature points used by MPEG-4 FAP to enable the
conversion between MUP and FAP. The facial movement defined by MPEG-4 FAP must
be valid facial movement or natural facial deformation because MUs are learned from
natural facial movements. If the user intends to use MPEG-4 FAP to describe exagger-
ated facial deformation, MUPs can still be obtained with respect to least square error.
However, the reconstructed facial shape using MUs and MUPs might be subject to unde-
sired artifacts.
2.4 Discussions
Equations (2.2), (2.3) and (2.4) are irrelevant to the dimension of the MUs. They can be
applied to 2D or 3D MUs. Interestingly, 2D MUs can be used to infer 3D facial deforma-
tion using Eq. (2.3). The conversion between MUPs and key frame parameters is based
on the high-level concept correspondences. More exactly, the correspondence is estab-
lished on the level of the whole face instead of that of individual facial points. Though
only 2D information is used in Eq. (2.3), the results enable 3D facial deformation if the
key frames are 3D because the facial deformation is finally expressed as a weighted com-
bination of the key frames. In the other words, Eq. (2.3) provides a way to infer 3D facial
deformation information from that in 2D.
To note, the conversions defined by Eqs. (2.2), (2.3), and (2.4) are not lossless ones. In-
stead, they try to preserve as much information as possible with respect to the mean
square error between the facial shape using the original representation and the one using
the new representation, to which the original representation can be converted. Since MUs
29
are designed to preserve the variances of real facial deformations as much as possible, the
overall information preserved by using MU as the representation should be the closest to
the real facial deformations. This is also the reason MUs rather than key frames/MPEG-4
FAPs are used as the quantitative visual representation for facial deformations in this the-
sis. However, not all the shapes of MUs are visually meaningful to human beings. There-
fore, it might be difficult for some users to directly use MUs. The user can start with key
frames, which is more visually meaningful, and using the above conversion technique
while building the face modeling and animation system.
30
CHAPTER 3
3 MU-BASED FACIAL MOTION TRACKING
In this chapter, we present and discuss the MU-based facial motion tracking algorithm.
As has been stated before in this thesis, the tracking results can be used directly for face
animation or training speech-driven face animation. The MU-based facial motion track-
ing algorithm covers the cheeks, which is very important for synthesizing visual speech.
3.1 Model Initialization
The MU-based tracking algorithm requires that the face be in its neutral position in the
first image frame. The tracking algorithm uses a generic mesh, which is the one shown in
Figure 2.1(b) while without the two points on the glasses. Choosing this mesh makes it
possible for the tracking algorithm to utilize MUs learned in Chapter 2. The generic mesh
has two vertices representing two mouth corners. The user manually selects two mouth
corners in the image. The generic mesh model is fitted to the face by scaling and rotating
so that the mouth corner vertices of the mesh are coincident with the selected points. An
example of mouth corner selection and mesh model fitting is show in Figure 3.1.
Figure 3.1 Model initialization for tracking.
(a) Select two mouth corners (b) Fitting results
31
3.2 Tracking as a Weighted Least Square Fitting Problem
The MU-based facial motion tracking algorithm consists of two steps. The first step is a
low-level image processing step, which conducts modelless tracking for the vertices of
the mesh. The potential locations of the feature points in the next image are calculated
separately by zero-mean normalized cross-correlation template matching [27]. Template
matching techniques can handle gradual changes of lighting. However, the template
matching results are usually noisy. In the second step, the high-level knowledge encoded
in MUs is added to constrain the results calculated in the first step. The template match-
ing results are converted into MUPs and global face motion parameters, which are high-
level control information and can be used directly for MU-compatible face animation.
The MU-based tracking algorithm is a 2D facial motion tracking algorithm because it
uses a 2D MU model. The algorithm further assumes that the geometric imaging model
of the face in the camera is an affine model. This assumption stands when the following
two conditions are fulfilled: (a) the distance between the camera of the face is much lar-
ger than the depth of the face (the size of the face along the view line of the camera); (2)
the face does not involve large rotation in the directions that are in/out the image plane.
3.2.1 Modelless tracking
First, the template matching step tracks vertex i of the mesh by tracking its corresponding
facial point according to the coordinates of vertex i. The facial point is also denoted as
point i for convenience. The term ),( ,1 t
jiti ss −Γ is defined as the zero-mean normalized
cross-correlation operator, where 1−tis is the template of the facial point i in the image
frame at time t-1, facial point j is one of the candidate points of point i in the image fame
at time t, and tjis , is the template of point j. According to the definition of zero-mean
normalized cross correlation, we have 1),(1 ,1 ≤Γ≤− − t
jiti ss . The similarity between point i
and point j is defined as
)1),(exp(),( ,1 −Γ= − t
jiti ssjiϖ (3.1)
32
The template matching step searches locally around point i in the next image frame. Usu-
ally, the search range is a w × h window that centers at point i. A point i* that best match
point i based on the criterion defined by Eq. (4.1) is selected, i.e., i* = ),(maxarg jij
ϖ .
The coordinate vector of the point i* in the image plane at time t is denoted as Tt
it
it
i yxv ],[ )()()( =r . Correspondingly, the similarity between i* and i is denoted by )(tiw . The
information that is fed into the next step includes )(tivr and )(t
iw .
3.2.2 Constrained by MUs
MUs are used to constrain the information obtained in the first step. Mathematically, the
tracking problem can then be formulated as a minimization problem
∑∑
∑
∑
=
=
=
=
−
+
+
+
=
−+Ψ=
n
it
i
ti
i
k
pipp
i
k
pipp
ti
CT
ni
tii
ti
CT
t
yx
tt
ymc
xmc
tttt
w
vCMwCT
1
2
)(
)(
6
3
)0(
02,
)0(
01,
54
21)(
,
2
,1
)()(
,
)(**
minarg
)(minarg),(
rr
rr
rrrrrζ
(3.2)
where:
(a) n is the number of vertices on the mesh model.
(b) Ψ(•) is the affine transformation function, whose parameter set TttttttT ],,,,,[ 654321=
r describes the global 2D rotation, scaling, and translation
transformations of the face. Ψ(•)i denotes the image coordinate vector of the ith
vertex after being transformed by Ψ(•).
(c) ][ 10 KmmmM rL
rr= and Tnpnpppp mmmmm ][ 2,1,12,11, L
r = (p = 0, …, K).
And Tpp mm ][ 12,11, represent the deformation characteristics of vertex i encoded
in pmr .
33
(d) TKcccC ][ 10 L
r= is the MUP vector and c0, c1, …, cK are the MUPs. Since
0mr is the mean deformation, c0 is a constant and is always equal to 1.
(e) Tnn yxyx ][ )0()0()0(
1)0(
1 Lr
=ζ represents the concatenation of the coordinates of
the vertices in their initial positions (or the neutral position) in the image plane.
(f) iCM )( ζrr
+Ψ represents the plausible coordinate of vertex i in the manifold de-
fined by MU with respect to Tr
and Cr
.
The unknown parameter set consists of TttttttT ],,,,,[ 654321=r
and TKccC ],,[ 1 K
r= .
The intuition of Eq. (3.2) is to find a set of motion parameters to minimize the mean
square error between the template matching results and the instance generated by the
high-level knowledge model using parameter Tr
and Cr
.
After rearranging matrix elements, Eq. (3.2) can be rewritten as
2
10
2*
][minarg
minarg
bqWAAA
bqAq
Kq
q
rrL
rrr
r
r
−=
−= (3.3)
where
++++
++++
=
)()(0000)()(
)()(0000)()(
)0(2,0
)(01,0
)(
02,0
)(01,0
)(
)0(112,0
)(1
)0(111,0
)(1
)0(112,0
)(1
)0(111,0
)(1
0
nnt
nnnt
n
nnt
nnnt
n
tt
tt
ymwxmwymwxmw
ymwxmwymwxmw
A LLLL
34
)1( ,
0000
0000
2,)(
1,)(
2,)(
1,)(
12,)(
111,)(
1
12,)(
111,)(
1
≥≥
= iK
mwmwmwmw
mwmwmwmw
A
nit
nnit
n
nit
nnit
n
it
it
it
it
i LLLL
=
)(
)(
)(1
)(1
00
00
tn
tn
t
t
ww
ww
W LL
Ttn
tn
tn
tn
tttt ywxwywxwb ][ )()()()()(1
)(1
)(1
)(1 L
r=
TKKKK tttctctctctctctctcttttq ][ 635421514121115421 L
r =
The least square estimator can be used to solve qr from Eq. (3.3). It is easy to first re-
cover t1, t2, t3, t4, t5, t6 from qr , and then calculate c1, …, cK.
3.3 Improving the MU-based Facial Motion Tracking Algorithm
The facial points that correspond to the vertices of the mesh may not have good texture
properties for tracking. This kind of point is called a bad feature point. It is very difficult
to accurately track those bad points. If there are many bad points being tracked, the tem-
plate-matching step will generate highly corrupted information. High-quality tracking
results cannot be guaranteed if the highly corrupted information is fed into the second
step and is combined with the information provided by the high-level knowledge model.
The error will accumulate very fast and lead to losing track quickly.
A heuristic method is used to improve it. The purpose is to make the information calcu-
lated in the low-level image processing step more reliable. For each facial point being
tracked, an image pixel with good texture properties is selected from a 3 x 3 window cen-
tering at the mesh vertex. The selected good image pixels are tracked across two consecu-
tive frames. Since the selected good image pixel is very close to its vertex in the image
35
(the maximum distance is one pixel), we can assume that the spatial relation between
them remains unchanged across two consecutive frames. Thus, we assign the displace-
ment of the good image pixel to that of its correspondent mesh vertex. Then, the remain-
ing of the calculation is exactly the same as Eq. (3.2).
The Kanade-Lucas-Tomasi (KLT) feature tracker is used to select and track good image
pixels. KLT was originally proposed by Lucas and Kanade [46] and was further devel-
oped by Tomasi and Kanade [78]. Readers should refer to [72] for details. Good features
are selected by examining the minimum eigenvalue of each 2 × 2 gradient matrix. Image
pixels are tracked using a Newton-Raphson procedure to minimize the difference be-
tween two images. Multiresolution tracking allows for large displacements between im-
ages. The accuracy of the tracking is up to the subpixel level. Intel’s Microprocessor Re-
search Lab implements the KLT tracker and makes it publicly available in OpenCV.1
3.4 Experimental Results2
In Figure 3.2, given the same initialization and image sequence, the performances of
three methods – the MU-based facial motion tracking algorithm, template matching using
zero-mean normalized correlation, and KLT tracker – are compared. The testing video is
captured using a Panasonic AG-7450 portable video cassette recorder and is digitized at
30 frames per second (fps). All three methods are implemented in C++ on Windows 2000
and run at 30+ fps. The tracking results are shown as the white mesh overlapping on the
face. The images are shown from left to right with the time increases.
The images at the top, middle, and bottom rows of Figure 3.2 are the tracking results of
template matching, the KLT tracker, and MU-based tracking algorithms respectively. As
is illustrated, the error of template matching accumulates quickly and eventually makes
tracking fail. The KLT tracker works better while still losing track of some points, such
1 The software is found at http://www.intel.com/research/mrl/research/opencv/.
2 The video clips showing the tracking results can be found at: http://www.ifp.uiuc.edu/~hong/Research/mouth_tracking.htm.
as some points in the cheeks. The MU-based tracking algorithm works best because MUs
provide good constraints and make tracking more robust.
(a) The 60th frame (b) The 160th frame (c) The 226th frame (d) The 280th frame
36
Figure 3.2 Comparison of the tracking results on an unmarked face using the MU-based facial motion tracking algorithm, template matching, and the KLT trackers. Only the tracking results of some typical mouth shapes in the test sequence are shown. The top row shows the tracking results using template matching. The mid-dle row shows the tracking results using template matching alone. The bottom row shows the tracking results of the MU-based facial motion tracking algorithm.
In the sequence shown in Figure 3.3, the face has global 3D motion, which can be noticed
by using the glasses as the reference. The face also has small motion in/out of the image
plane. The MU-based tracking algorithm assumes affine projection. Therefore it can han-
37
dle this kind of case. However, to handle large global 3D motion, it requires 3D MUs and
perspective projection assumption.
All three tracking methods are applied to a face image sequence, in which the face is
marked. The makers provide ground truth that can be used for comparison. The initializa-
tions of all three methods are the same. The vertices are manually and carefully placed at
the centers of the markers. The tracking results are compared in Figure 3.3 (see page 38).
The images in the top, middle and bottom rows of Figure 3.3 are the tracking results of
template matching, the KLT tracker, and the MU-based tracking algorithm respectively.
As the results show, even on the face with markers, which provide salient features, tem-
plate matching does not work well. This is because some templates change greatly and
suddenly while the facial surface deforms. The KLT tracker works much better, but still
loses track of some points and results in irregular local structures. This can be observed
by looking at the tracking results on the upper lip. Again, the MU-based tracking algo-
rithm works best because it uses MUs to adjust bad tracking results.
Besides robustness, the MU-based tracking algorithm explains the face motion and facial
motion into the parameters of affine transformation and MUPs, which can be used di-
rectly for face animation and beyond the capability of template matching technique and
the KLT tracker.
3.5 Discussions
So far, we only focus on the lower face because MUs are currently designed to cover the
lower face only. The same method can be extended to track the whole face by expanding
MUs to cover the whole face. Besides, the algorithm described in Section 3.2 and the
proposed algorithms, which will be described in Section 3.6, 3.7, and 3.8, are very gen-
eral and can be applied to other objects, for example, human body. MUs are learned from
the training data of an individual. To achieve better generalization performance, MUs
should be learned from the training data of multiple subjects from different age ranges
and races.
If 3D MUs are available, we can modify the 2D MU-based tracking algorithm described
in Section 3.2 and get a 3D MU-based tracking algorithm which has the same calculation
procedure and similar forms of the equations. The new 3D MU-based facial motion track-
ing algorithm is theoretically described in Section 3.6. It can handle both the global 3D
motion of the face and the local 3D facial motions. The theory of a 3D MU-based track-
ing algorithm using multiple cameras is also developed and described in Section 3.7. Us-
ing multiple cameras will capture more information, which can be used to make the
tracking algorithm more robust.
38
Figure 3.3 Comparison of the tracking results on a marked face using the MU-based facial motion tracking algorithm, template matching, and the KLT tracker. Only some typical tracking results in the test sequence are shown. The top row shows the tracking results using the MU-based facial motion tracking algorithm. The middle row shows the tracking results using template matching alone. The bot-tom row shows the tracking results using the KLT tracker alone.
(a) The 36th frame (b) The 67th frame (c) The 108th frame (d) The 170th frame
39
The low-level image processing of our MU-based facial motion tracking is different from
B-spline based approaches [6], [12], [37], Snake based approaches [9], [36], [77], or de-
formable template approach [88]. While the approaches in [6], [9], [12], [36], [37], [77],
and [88] rely on color segmentation, gradients, or edges, which is sensitive to lighting
conditions and depends on the color properties of the subjects, we select good feature
points for reliable tracking. Good feature points enable us to track the movements of
cheeks, where edges can hardly be found and color segmentation will fail. The high-level
knowledge of our approach is also different from theirs. While the approaches described
in [9] and [36] learn high-level knowledge from real data, other approaches [6], [12],
[37], [77], [88] define high-level knowledge subjectively.
The MU-based facial motion tracking algorithm only models shape information. ASM
[17], AAM [52], and eigen-points [19] model both shape and appearance. Appearance, as
another image cue, provides extra information. The question is how to model and use it.
To handle various lighting conditions, the texture part of the training data should cover
broad enough conditions. To collect training data for texture, the face of the subject can-
not be marked. Thus extra care has to be taken while collecting the training data for
ASM, AAM, and eigen-points, so that the landmark points selected in different image
frames are physically the same. If the correspondences cannot be guaranteed, the training
data is biased.
The low-level image processing method used in our approach is similar to those used in
[21], [22], [25], [44], and [75]. The approaches described in [21], [22], [25], [44], and
[75] are 3D facial motion tracking algorithm. The MU-based facial motion tracking algo-
rithm is a 2D approach. However, the high-level knowledge used in [21], [22], [25], [44],
and [75] is either hand-tuned or subjectively defined as some forms of functions. MUs
are learned for real-facial deformation data. If 3D real facial deformation training data are
available, we can extend our approach to 3D facial motion tracking.
The high-level knowledge used by Basu et al. [3] and that used in our approach are
learned from real data. There are four main differences between these two approaches.
The first difference is the facial areas covered by the tracking algorithm. The MU-based
40
tracking algorithm covers cheeks, which are very important for both face animation and
human speech perception. The approach described in [3] only deals with lips. Second, the
low-level image processing steps are different. Basu et al. [3] use color information to
distinguish lips and other areas of facial skin. We track feature points. Thirdly, the ap-
proach proposed by Basu and Pentland track 3D lip motion. Currently, MU is 2D so that
MU-based facial motion tracking can only track 2D facial motion with respect to affine
projection. Finally, Basu et al. [3] use a complicated physics-based lip model. We use a
simple geometric mesh to directly model the face surface without any control model.
Thus we have less computational complexity.
3.6 3D MU-based Facial Motion Tracking
Assuming MUs are 3D and affine camera models [56], the mathematical representation
of the tracking problem using one camera can be written as
∑
∑
∑
∑
∑
=
=
=
=
=
−
+
+
+
+
=
−+Ψ=
nit
i
ti
i
K
pipp
i
K
pipp
i
K
pipp
ti
CT
ni
tii
ti
CT
t
yx
tt
mc
mc
mc
tttttt
w
vCMwCT
,1
2
)(
)(
24
14
30
3,
20
2,
10
1,
232221
131211)(
,
2
,1
)()(
,
)(**
minarg
)(minarg),(
ϕ
ϕ
ϕ
ζ
rr
rr
rrrrr
(3.4)
where:
(a) n is the number of vertices on the mesh model.
(b) Ψ(•) is the projection function of the affine camera. Its parameters include t11, t12,
t13, t14, t21, t22, t23, t24, which describe global 3D rotation, scaling and translation
transformations of the face. Ψ(•)i denotes the coordinate vector of the ith vertex
after being transformed by Ψ(•).
41
(c) ][ 10 KmmmM rL
rr= and [ ]Tnpnpnppppp mmmmmmm 3,2,1.13,12,11, Lr = (p = 0,
…, K). And Tppp mmm ][ 13,12,11, represent the deformation characteristics of ver-
tex i encoded in pmr .
(d) TKcccC ][ 10 L
r= is the MUP vector and c0, c1, …, cK are the MUPs. Since
0mr is the mean deformation, c0 is a constant and is always equal to 1.
(e) Tnnn ][ 321131211 ϕϕϕϕϕϕζ L
r= represents the concatenation of the coordi-
nates of the vertices in their initial positions (or the neutral position) relative to the
camera. Tiii ][ 321 ϕϕϕ represents the coordinate vector of vertex i at its neutral
position. In contrast to the 2D MU-based facial motion tracking algorithm, 3D
MU-based tracking requires the initialization of the face model be done in 3D in-
stead of the 2D image plane. This can be done if the camera is calibrated and the
actual size of the face is known.
(f) iCM )( ζrr
+Ψ represents the plausible coordinate of vertex i in the manifold de-
fined by MU with respect to Tr
and Cr
.
The unknown parameter set consists of [ ]TttttttttT 2423222114131211=r
and
TKccC ],,[ 1 K
r= .
Eq. (3.4) can be rewritten as
[ ] 2
2112110201
2*
minarg
minarg
bqWAAAAAA
bqAq
KKq
q
rrL
rrr
r
r
−=
−= (3.5)
where:
42
+++
+++
=
000)()()(
000)()()(
33,0)(
22,0)(
11,0)(
313,0)(
1212,0)(
1111,0)(
1
01
nnt
nnnt
nnnt
n
nt
nt
nt
mwmwmw
mwmwmw
Aϕϕϕ
ϕϕϕ
LLL
+++
+++=
)()()(000
)()()(000
33,0)(
22,0)(
11,0)(
313,0)(
1212,0)(
1111,0)(
1
02
nnt
nnnt
nnnt
n
nt
nt
nt
mwmwmw
mwmwmwA
ϕϕϕ
ϕϕϕLLL
=
000
000
3,)(
2,)(
1,)(
13,)(
112,)(
111,)(
1
1
nit
nnit
nnit
n
it
it
it
i
mwmwmw
mwmwmw
A LLL (K ≥ i ≥ 1)
=
3,)(
2,)(
1,)(
13,)(
112,)(
111,)(
1
2
000
000
nit
nnit
nnit
n
it
it
it
i
mwmwmw
mwmwmwA LLL (K ≥ i ≥ 1)
=
)(
)(
)(1
)(1
00
00
tn
tn
t
t
ww
ww
W LL
Ttn
tn
tn
tn
tttt ywxwywxwb ][ )()()()()(1
)(1
)(1
)(1 L
r=
43
[ ][ ][ ]
[ ]TK
Tiiiiiii
T
TTK
TK
TT
ttq
iKtctctctctctcq
ttttttq
qqqqq
24141
232221131211
2322211312110
110
)1(,
=
≥≥=
=
=
+
+
r
r
r
rrL
rrr
Use a least square estimator to solve Eq. (3.5) and get qr . It is then easy to get
[ ]2423222114131211 ttttttttT =r
and ],,[ 1 KccC Kr
= from qr . Given a calibrated camera,
the pose information about the face can be calculated from Tr
.
The forms of Eqs. (3.4) and (3.5) are similar to those of Eqs. (3.2) and (3.3). Therefore,
the programming written for the 2D MU-based tracking algorithm can be easily modified
for the 3D MU-based tracking algorithm.
Though perspective projections give accurate models for a wide range of existing cam-
eras, the mapping from an object point to the image point is nonlinear. In order to make
the projection model more mathematically tractable, affine cameras are used. The affine
camera is a first-order approximation obtained from the Taylor expansion of the perspec-
tive camera model. If the affine camera model is used, the mapping from an object point
to the image point is linear. The assumption of affine camera model will work well when
the size of the face are relatively much smaller than the distance between the head and the
camera. If the affine camera is calibrated, we recover the true 3D facial motions.
3.7 3D MU-based Facial Motion Tracking Using Multiple Cameras
If multiple synchronized cameras are used, Eq. (3.4) can be easily modified to take ad-
vantage of the information captured by those cameras. The details are shown as below.
44
∑∑∑ ∑
∑ ∑
∑∑
∑
∑
∑
∑∑
= =
= =
= =
= =
=
=
=
= =
−
++
++=
−
+
+
+
+
=
−+Ψ=
Q
j
n
itij
tij
js
isjsj
K
pisppsj
js
isjsj
K
pisppsj
tij
CT
Q
j
n
itij
tij
j
j
ij
K
pipp
ij
K
pipp
ij
K
pipp
jjj
jjjtij
CT
Q
j
n
i
tijijj
tij
CT
t
yx
ttmct
ttmctw
yx
tt
mc
mc
mc
tttttt
w
vCMwCT
1 1
2
)(,
)(,
24,
3
1,2,
0,2,
14,
3
1,1,
0,1,
)(,
,
1 1
2
)(,
)(,
24,
14,
3,0
3,
2,0
2,
1,0
1,
23,22,21,
13,12,11,)(,
,
1 1
2)(,
)(,
,
)(**
)(
)(minarg
minarg
)(minarg),(
ϕ
ϕ
ϕ
ϕ
ϕ
ζ
rr
rr
rr
rrrrr
(3.6)
where:
(a) Q is the number of the cameras, and n is the number of vertices on the mesh
model. The index j is used to denote camera.
(b) Ψj(•) is the projection function of the affine camera j. Its parameters include tj,11,
tj,12, tj,13, tj,14, tj,21, tj,22, tj,23, tj,24, which describe the global 3D rotation, scaling, and
translation transformations of the face with respect to camera j. Ψj(•)i denotes the
coordinate vector of the ith vertex after being transformed by Ψj(•) with respect to
camera j.
(c) ][ 10 KmmmM rL
rr= and [ ]Tnpnpnppppp mmmmmmm 3,2,1.13,12,11, Lr = (p = 0,
…, K). And Tppp mmm ][ 13,12,11, represent the deformation characteristics of ver-
tex i encoded in pmr .
(d) TKcccC ][ 10 L
r= is the MUP vector and c0, c1, …, cK are the MUPs. Since
0mr is the mean deformation, c0 is a constant and is always equal to 1.
(e) Tnjnjnjjjjj ][ 3,2,1,13,12,11, ϕϕϕϕϕϕζ L
r= represents the concatenation of the
coordinates of the vertices in their initial positions (or the neutral position) rela-
45
tive to the camera j. Tijijij ][ 3,2,1, ϕϕϕ represents the coordinate vector of vertex i
at its neutral position relative to the camera j.
(g) ijj CM )( ζrr
+Ψ represents the plausible coordinate of vertex i in the manifold de-
fined by MU with respect to Tr
, Cr
, and the camera j.
The unknown parameter set consists of
[ ]TQQQQQQQQ ttttttttttttttttT 24,23,22,21,14,13,12,11,24,123,122,121,114,113,112,111,1 Lr
=
and TKccC ],,[ 1 K
r= .
The same trick to rearrange Eqs. (3.2) and (3.4) can be applied to Eq. (3.6), which can be
rewritten as
[ ] 2
10
2*
minarg
minarg
bqWAAA
bqAq
Kq
q
rrL
rrr
r
r
−=
−= (3.7)
where:
(a) ][ 2,01,012,011,00 QQ AAAAA L=
+++
+++
=
000)()()(
000)()()(
3,3,0)(
,2,2,0)(
,1,1,0)(
,
3,13,0)(1,2,12,0
)(1,1,11,0
)(1,
1,0
njntnjnjn
tnjnjn
tnj
njt
jnjt
jnjt
j
j
mwmwmw
mwmwmw
Aϕϕϕ
ϕϕϕ
LLL (Q ≥ j ≥ 1)
+++
+++=
)()()(000
)()()(000
3,3,0)(
,2,2,0)(
,1,1,0)(
,
3,13,0)(1,2,12,0
)(1,1,11,0
)(1,
2,0
njntnjnjn
tnjnjn
tnj
njt
jnjt
jnjt
j
j
mwmwmw
mwmwmwA
ϕϕϕ
ϕϕϕLLL (Q ≥ j ≥ 1)
46
(b) ][ 2,1,12,11, QiQiiii AAAAA L= (K ≥ i ≥ 1)
=
000
000
3,)(
,2,)(
,1,)(
,
13,)(1,12,
)(1,11,
)(1,
1,
nitnjni
tnjni
tnj
it
jit
jit
j
ji
mwmwmw
mwmwmw
A LLL (Q ≥ j ≥ 1)
=
3,)(
,2,)(
,1,)(
,
13,)(1,12,
)(1,11,
)(1,
2,
000
000
nitnjni
tnjni
tnj
it
jit
jit
j
ji
mwmwmw
mwmwmwA LLL (Q ≥ j ≥ 1)
(c)
=
)(,
)(,1
)(,
)(,1
)(1,
)(1,1
)(1,
)(1,1
0000
0000
tnQ
tn
tnQ
tn
tQ
t
tQ
t
wwww
wwww
W
L
L
LLLLL
L
L
(d) TtnQ
tnQ
tnQ
tnQ
tn
tn
tn
tn
tQ
tQ
tQ
tQ
tttt ywxwywxwywxwywxwb ][ )(,
)(,
)(,
)(,
)(,1
)(,1
)(,1
)(,1
)(1,
)(1,
)(1,
)(1,
)(1,1
)(1,1
)(1,1
)(1,1 LLL
r=
(e) [ ]TTK
TK
TT qqqqq 110 += rrL
rrr
[ ]TQQQQQQ ttttttttttttq 23,22,23,23,23,23,23,122,121,113,112,111,10 Lr =
[ ]TQiQiQiQiQiQiiiiiiii tctctctctctctctctctctctcq 23,22,23,23,23,23,23,122,121,113,112,111,1 Lr = (K ≥ i ≥ 1)
[ ]TQQK ttttq 24,14,24,114,11 Lr =+
47
Use a least square estimator to solve Eq. (3.7) and get qr . It is then easy to get
[ ]TQQQQQQQQ ttttttttttttttttT 24,23,22,21,14,13,12,11,24,123,122,121,114,113,112,111,1 Lr
= and
],,[ 1 KccC Kr
= from qr .
Again, the forms of Eq. (3.6) and Eq. (3.7) are similar to those of Eq. (3.2) and Eq. (3.3).
Therefore, the programming written for the 2D MU-based tracking algorithm can also be
modified for the 3D MU-based tracking algorithm using multiple cameras without major
changes on the structure of the programming.
3.8 3D MU-BSV-based Facial Motion Tracking
The tracking algorithms presented in Sections 3.2, 3.6, and 3.7 require an accurate face
model (or the facial shape at its neutral state). Here, a new algorithm with looser con-
straints is presented. It assumes any face model ςr can be obtained by
∑=
+=E
eeeh
1γζς rrr (3.8)
where
(a) Tnnn ][ 321131211 ϕϕϕϕϕϕζ L
r= represents the concatenation of the coordi-
nates of the vertices in their initial positions (or the neutral position) relative to the
camera. Tiii ][ 321 ϕϕϕ represents the coordinate vector of vertex i at its neutral
position. In contrast to the previous two 3D MU-based facial motion tracking al-
gorithms, ζr
in Eq. (3.8) is guessed by the tracking algorithm by warping a ge-
neric face model.
(b) The warped generic face model may not well suit the face of the subject. How-
ever, its shape can be adjusted using a set of basic shape variances (BSVs) of the
face by ∑ =
E
e eeh1
γr . Eee 1}{ =γr is the set of BSVs, which can be learned from real fa-
cial 3D shape data, for example, by applying PCA to a set of 3D neutral face
shapes (or face without deformations).
48
(c) The parameter set Eeeh 1}{ = is unknown and will be adjusted during tracking. Let
[ ]Tneneneeeee rrrrrr 3,2,1.13,12,11, Lr =γ .
The new algorithm will use MUs and the BSVs. Therefore, it is called the 3D MU-BSV-
based facial motion tracking algorithm.
Eq. (3.4) can be modified to use the basic face shapes as below.
∑∑ ∑∑
∑ ∑∑
∑
∑∑
∑∑
∑∑
∑ ∑
=
= ==
= ==
=
==
==
==
= =
−
+++
+++=
−
+
++
++
++
=
−++Ψ=
nit
i
ti
s
E
eiseesiss
K
pispps
s
E
eiseesiss
K
pispps
ti
CT
nit
i
ti
E
eieei
K
pipp
E
eieei
K
pipp
E
eieei
K
pipp
ti
CT
ni
tii
E
eee
ti
CT
t
yx
trhttmct
trhttmctw
yx
tt
rhmc
rhmc
rhmc
tttttt
w
vhCMwCT
,1
2
)(
)(
24
3
1 1,22
0,2
14
3
1 1,11
0,1
)(
,
,1
2
)(
)(
24
14
13,3
03,
12,2
02,
11,1
01,
232221
131211)(
,
2
,1
)(
1
)(
,
)(**
)(
)(minarg
minarg
)(minarg),(
ϕ
ϕ
ϕ
ϕ
ϕ
γζ
rr
rr
rr
rrrrrr
(3.9)
The unknown parameter set consists of TKccC ],,[ 1 K
r= , T
EhhH ][ 1 Lr
= and
[ ]TttttttttT 2423222114131211=r
.
Eq. (3.9) can be rewritten as
[ ] 2
12112110201
2*
minarg
minarg
bqWBBAAAAAA
bqAq
EKKq
q
rrLL
rrr
r
r
−=
−= (3.10)
where:
49
(a)
+++
+++
=
000)()()(
000)()()(
33,0)(
22,0)(
11,0)(
313,0)(
1212,0)(
1111,0)(
1
01
nnt
nnnt
nnnt
n
nt
nt
nt
mwmwmw
mwmwmw
Aϕϕϕ
ϕϕϕ
LLL
(b)
+++
+++=
)()()(000
)()()(000
33,0)(
22,0)(
11,0)(
313,0)(
1212,0)(
1111,0)(
1
02
nnt
nnnt
nnnt
n
nt
nt
nt
mwmwmw
mwmwmwA
ϕϕϕ
ϕϕϕLLL
(c)
=
000
000
3,)(
2,)(
1,)(
13,)(
112,)(
111,)(
1
1
nit
nnit
nnit
n
it
it
it
i
mwmwmw
mwmwmw
A LLL (K ≥ i ≥ 1)
(d)
=
3,)(
2,)(
1,)(
13,)(
112,)(
111,)(
1
2
000
000
nit
nnit
nnit
n
it
it
it
i
mwmwmw
mwmwmwA LLL (K ≥ i ≥ 1)
(e)
=
3,)(
12,)(
11,)(
1
3,)(
12,)(
11,)(
1
13,)(
112,)(
111,)(
1
13,)(
112,)(
111,)(
1
000000
000000
nit
nit
nit
nit
nit
nit
it
it
it
it
it
it
i
rwrwrwrwrwrw
rwrwrwrwrwrw
B LLLLLL (E ≥ i ≥ 1)
(f)
=
)(
)(
)(1
)(1
00
00
tn
tn
t
t
ww
ww
W LL
50
(g) Ttn
tn
tn
tn
tttt ywxwywxwb ][ )()()()()(1
)(1
)(1
)(1 L
r=
(h) [ ]TTEK
TEK
TK
TK
TT qqqqqqq 1110 ++++= rrL
rrL
rrr
[ ]Tttttttq 2322211312110 =r
[ ] )1(,232221131211 ≥≥= iKtctctctctctcq Tiiiiiii
r
[ ] )1(,232221131211 ≥≥=+ iEththththththq TiiiiiiiK
r
[ ]TK ttq 24141 =+r
Using a least square estimator to solve Eq. (3.9) and get qr . It is then easy to get
[ ]2423222114131211 ttttttttT =r
, TEhhH ][ 1 L
r= , and ],,[ 1 KccC K
r= from qr . Given a
calibrated camera, the pose information about the face can be calculated from Tr
.
The 3D MU-BSV-based facial motion tracking algorithm can be easily generalized to use
multiple cameras by using the same method described in Section 3.7.
51
CHAPTER 4
4 MU-BASED REAL-TIME SPEECH-DRIVEN
FACE ANIMATION
The facial motion analysis and synthesis techniques described in previous chapters pave
the way to achieve MU-based real-time speech-driven face animation. The MU-based
facial tracking algorithm is used to analyze the facial motions of a speaking subject. The
analysis results and the synchronized soundtrack can be collected to train audio-to-visual
mappings. In this chapter, two audio-to-visual mappings are presented and evaluated.
One of them is a local linear mapping. The other is a nonlinear mapping using neural
networks. Both methods consider certain length of speech context and have constant
short time delay.
4.1 Linear Audio-to-Visual Mapping
Linear mapping [87] assumes information in a channel can be calculated from that in an-
other channel by an affine transformation. In the case of audio-to-visual mapping, it can
be written down as
eaTv anvavnrrrrr +−=− )( µµ (4.1)
where nvr is the visual feature vector at time n, nar is the audio feature vector at time n,
vµr and aµr are mean vectors of the visual features and the audio features respectively,
vaT is the affine transformation, and er is the error term which represents the part of
)( ana µrr − that is not correlated with )( vnv µrr − . The transformation can be estimated as
1]))([(]))([( −−−−−= Tanan
Tanvnva aaEavET µµµµ rrrrrrrr (4.2)
The result given by Eq. (4.2) is a minimum variance unbiased estimator of )( vnv µrr −
[89].
52
4.2 Local Linear Audio-to-Visual Mapping
Eq. (4.1) does not consider the contextual information of audio. However, the mouth
coarticulation depends on the context of audio. The length of the context depends on the
content of the audio and the subject who speaks it. It is difficult to decide a particular
value for the length of the context. A practical way to take contextual information into
account is to replace nar with
=
+
−
β
α
n
n
n
n
a
a
a
a
rM
rM
r
r ' (4.3)
where nar is the audio feature vector at time n, α−nar , …, and 1−nar represent the audio his-
tory from time n-α to n-1, and 1+nar , …, and β+nar represent the audio history from time
n+1 to n+β. α and β are the parameters that can be adjusted.
Linear estimation is very computationally efficient. It is ideal for a system with limited
computational resources. However, audio-to-visual mapping is nonlinear in nature. The
performance of the global linear mapping defined by Eq. (4.1) is very limited. There is
one way to improve it. We approximate the true audio-to-visual mapping by a set of lin-
ear mappings, i.e., a set of local linear mappings. Each linear mapping is defined for a
particular case of audio context.
As illustrated in Figure 4.1, the audio-visual training data is divided into 44 subsets ac-
cording to the audio feature of each sample. The audio features of each subset are mod-
eled by a Gaussian model. To divide the audio-visual training set, each audio-visual data
pair is classified into one of the 44 training subsets whose Gaussian model gives the
highest score for the audio component of the audio-visual data pair. Then, a linear audio-
to-visual mapping is calculated for each training subset using Eq. (4.1) and Eq. (4.3). The
reason that we choose 44 is based on a practical issue: our iFACE system uses a symbol
set that consists of 44 phonemes.
Figure 4.1 Local linear audio-to-visual mapping.
Since the variance of the data in each group is smaller than the whole training data set,
the complexity of the audio-to-visual mapping problem is dramatically reduced. Given a
new audio feature, we classify it into one of the classes using the trained Gaussian models
and select the corresponding linear mapping to estimate the visual feature vector.
4.3 Nonlinear Audio-to-Visual Mapping Using ANN
The mapping from audio features to visual features is by nature nonlinear. To achieve
better estimation results, nonlinear mapping should be used when enough computational
en audio features and the visual fea-
Audio feature space Visual feature space 1vaT
ivaT
jvaT
kvaT
mvaT
44vaT
resources are available. The nonlinear relation betwe
53
tures is complicated, and there is no existing analytic expression for the relation. Multi-
layer perceptrons, as a universal nonlinear function approximator, are used to learn the
nonlinear audio-to-visual mapping. In contrast to the approach in [50], the training data is
divided into 44 subsets in the way described in Section 4.2. For each training subset, a
three-layer perceptron is trained using one of the subsets.
The structure of the MLP is shown in Figure 4.2. The input of MLP is the audio feature
vector taken at α+β+1 consecutive time frames (α backward, current, and β forward time
windows). The output of an MLP is a visual feature vector. The estimation procedure is
similar to that of the local linear mapping except that a MLP is selected.
54
Figure 4.2 MLP for nonlinear audio-to-visual mapping.
4.4 Experimental Results
4.4.1 Collect training and testing data
A set of raw video data is collected by recording the front view of a speaking subject. A
set of markers that is the same as that shown in Figure 2.1 (see page 18) is put on the face
of the subject. Therefore, the ground truth can be extracted. The video is captured using a
Panasonic AG-7450 portable video cassette recorder. One hundred sentences are selected
from the text corpus of the DARPA TIMIT speech database. Both the audio and video are
digitized at 30 fps using Final Cut Pro software for Macintosh from Apple Computer In-
corporated. Overall, there are 19433 audio-visual training data samples. Eighty percent of
the data is used for training. The rest is used for testing.
For each speech segment, twelve LPC coefficients are calculated as the audio features.
MUs are used to explain facial deformations. Correspondingly, MUPs are used as the
visual features. The MU-based facial motion tracking algorithm, which is described in
Chapter 4, is used to analyze facial motions.1 This data set is used for training both the
local linear audio-to-visual mapping and the nonlinear audio-to-visual mapping using
1 Although the face is marked, we cannot simply use template matching technique or KLT trackers to track the mark-ers. The reason has been shown by the experimental results in Section 3.4. MU-based facial tracking algorithm is still required
][ 121 kkn ccccv −= Lr
Hidden Layer
[ ]Tn
Tn
Tn
Tn aaaa βα +−= r
Lr
Lrr '
Output layer
Input layer
55
ANN. The first five MUs are used.2 The normalized mean square error of the recon-
structed the data using the first five MUs is 0.0200.3
4.4.2 Implementation
A method using triangular average window is used to smooth the jerky mapping results
of both the local linear audio-to-visual mapping and the nonlinear audio-to-visual map-
ping using ANN. The implementation of MLP is provided by the Neural Network Tool-
box of MATLAB 5.0 of the MathWorks Incorporated. In the experiments, the maximum
number of the hidden units used in those MLPs is only 25. Therefore, both training and
estimation have very low computational complexity. The training of each MLP stops if
either the maximum number of iterations (150) or a preset mean square error threshold
(0.005) is met.
The author also tried to use one MLP to handle all the training data. However, the train-
ing process took too long. The training process ran for 3 weeks on an SGI machine with
12 processors and 2 GB memory. The quality of the intermediate results is far from ac-
ceptable.
4.4.3 Evaluation
We reconstruct the displacement of the mesh vertices using MUs and the estimated
MUPs. The evaluations are performed on the ground truth of the displacements and the
reconstructed displacements. Two evaluation parameters, Pearson product-moment corre-
lation coefficient and the normalized mean square error, are calculated.
• Pearson product-moment correlation coefficient
The Pearson product-moment correlation coefficient between the ground truth and the
estimated data is calculated by
2 More MUs can be used to achieve better results. However, more MUs cause higher computational complexity, espe-cially for nonlinear mapping using neural networks.
3 The displacement of each vertex is scaled to [-1.0, 1.0] by dividing it by the maximum displacement of the vertex.
56
])))([((])))([((
])))([((
2'
2'
11
2'
1
Tnn
Tnn
Tnn
ddEtrddEtr
ddEtrRµµµµ
µµrrrrrrrr
rrrr
−−−−
−−= (4.4)
where ndr
is the ground truth, )(1 ndErr =µ , '
ndr
is the estimation result, and )( '2 ndE
rr =µ .
The Pearson product-moment correlation measures how good the global match between
the shapes of two signal sequences is. The larger the Pearson correlation coefficient, the
better the estimated signal sequence matches with the original signal sequence.
Table 4.1 shows the performances of the global linear mapping, that local linear mapping,
and the local nonlinear mapping. As shown, the local nonlinear mapping works best and
the global linear mapping work worst.
Table 4.1 Real-time speech driven evaluation I.
Training Data Testing Data
Global linear mapping 0.5171 0.4834
Local linear mapping 0.7022 0.6904
Nonlinear mapping using ANN 0.9140 0.8902
However, the training data is not exactly the ground truth. Instead, only the information
contributed by the selected MUs is used. The Pearson coefficients are recalculated by re-
placing ndr
in Eq. (4.4) with the information preserved by the selected MUs. More ex-
actly, ndr
is replaced by 04
1)( mmmd ii
Tin
rrrr+∑ =
, which is called the biased ground truth.
The results are shown in Table 4.2. As is shown, the local linear audio-to-visual mapping
performs better than the global linear audio-to-visual mapping. The nonlinear mapping
using artificial neural networks works best.
57
Table 4.2 Real-time speech driven evaluation II.
Training Data Testing Data
Global linear mapping 0.6137 0.6056
Local linear mapping 0.8081 0.7896
Nonlinear mapping using ANN 0.9797 0.9685
• Normalized MSE
The displacement vector ndr
is normalized in the following way. The displacement of
each vertex is scaled to [-1.0, 1.0] by dividing it by the maximum displacement of the
vertex. The MSE of every audio-visual mapping with respect to the ground truth is shown
in Table 4.3.
Table 4.3 Real-time speech driven evaluation III.
Training Data Testing Data
Global linear mapping 0.0411 0.0456
Local linear mapping 0.0321 0.0342
Nonlinear mapping using ANN 0.0218 0.0221
The normalized MSE of each audio-to-visual mapping method with respect to the biased
ground truth is also calculated and shown in Table 4.4. The evaluation results based on
MSE index also show that the local linear audio-to-visual mapping performs better than
the global linear audio-to-visual mapping. The nonlinear audio-to-visual mapping using
neural networks works best.
58
Table 4.4 Real-time speech driven evaluation IV.
Training Data Testing Data
Global linear mapping 0.0295 0.0310
Local linear mapping 0.0166 0.0183
Nonlinear mapping using ANN 0.0025 0.0029
4.4.4 A speech-driven face animation example
Section 4.4.3 illustrates the experimental results in a macro way. In this section, to give
the readers a straightforward visual perception of the estimation results, a typical example
is randomly selected and shown in detail.
The text of the selected audio track is “Don’t ask me to carry an oil rag like that.” The
global linear mapping, the local linear mapping and the nonlinear mapping using neural
networks are used to estimate the visual feature sequence for the audio track. The results
are shown in Figures 4.3, 4.4, and 4.5 respectively. In those figures, the values of the es-
timated MUPs are shown as trajectories versus time. Four trajectories are shown in each
figure. They correspond to the coefficient trajectories of the MUs 1mr , 2mr , 3mr , and 4mr .
The horizontal axes of the figures represent time. The vertical axes of the figures repre-
sent the magnitude of the MUPs. The solid red lines represent the goal trajectories. The
dashed blue lines represent the estimation results.
The trajectories estimated by the local linear mapping are closer to the goal trajectories
than those estimated by the global linear mapping. The trajectories estimated by the
nonlinear mapping using neural networks are the closest to the goal trajectories.
59
Figure 4.3 The estimation results of the global linear mapping.
c1
c2
c3
c4
60
Figure 4.4 The estimation results of the local linear mapping.
c1
c2
c3
c4
61
Figure 4.5 The estimation results of the nonlinear mapping using neural networks.
c1
c2
c3
c4
62
CHAPTER 5
5 THE IFACE SYSTEM
This chapter describes the iFACE system [32]. The system provides functionalities for
face modeling and face animation and it provides a research platform for the integrated
framework (Figure 1.1 page 16). The system is also being used by other researchers to
carry out research on human perception on synthetic talking faces. Based on the iFACE
system, we won the fourth place in the V. Dale Cozad Business Plan Competition 2000.
5.1 Introduction
The iFACE system takes the CyberwareTM scanner data of a subject’s head as input and
allows the user to interactively fit a generic face model to the CyberwareTM scanner data.
The iFACE system uses the key frame technique for text driven face animation and off-
line speech-driven face animation. The real-time speech driven function of the iFACE
system is based on the techniques described in Chapters 2 and 4.
5.2 Generic Face Model
The generic face model (Figure 5.1) used in the iFACE system was originally bought
from Viewpoint Corporation1 and was modified lately by adding a tongue model and a
teeth model [90]. The head model consists of nearly all the head components such as
face, eyes, teeth, ears, and so on. It consists of 2240 vertices and 2946 triangles. The sur-
faces of the components are approximated by triangular meshes. The tongue component
is modeled by a Non-Uniform Rational B-Splines (NURBS) model which has 63 control
points.
1 Source: http://www.viewpoint.com.
63
One advantage of the polygon topology model is that the calculations of the surface de-
formation can be carried out much faster than those of physics-based models.
Figure 5.1 The generic geometry face model.
5.3 Customize the Face Model
The iFACE system enables the user to customize the generic face model for an individ-
ual. The iFACE system adopts an approach similar to that of [41]. Both methods take the
CyberwareTM cyberscanner data of a subject as the input, ask the user to manually select
some feature points, and warp the generic model to fit the CyberwareTM cyberscanner
data. The differences between them include the definitions of the feature point set and the
ways to warp the generic model.
The laser head and the laser sensor of the CyberwareTM cyberscanner rotate 360 degrees
around the subject who should keep still for a few seconds. The sensor captures the laser
reflected from the surface of the head and measures both the 3D range information and
the texture information of the surface. The range data is a map that records the distance
from the laser sensor to points on the head surface. The texture data is a reflectance im-
age of the laser beam from the head surface. Both the range map and the texture data are
represented in cylindrical coordinates with the longitude of 512 (representing 0-360 de-
(a) Shown as wire-frame (b) Shown as shaded
64
grees) and the latitude of 512. Figure 5.2 shows a pair of range data and texture data,
which are unfolded as 2D images.2
Figure 5.2 An example of the CyberwareTM cyberscanner data.
A coarse to fine approach is used to fit the generic face model to the range data. A coarse
model (Figure 5.3) is built by manually selecting 101 vertices from the generic face
model. The coarse model is a triangular mesh and consists of 164 triangles. The fitting
procedure asks the user to manually select 31 landmarks on the texture map of the Cy-
berwareTM cyberscanner data. The coarse model is first warped to fit the range data. The
whole generic face model is then warped to fit the range data.
Figure 5.3 The coarse model in 2D cylindrical coordinate space.
2 This cyberscanner data is the head of Dr. Russell L. Storms from the Federal Army Research Laboratory.
(a) Texture data (b) Range data
Thirty-one vertices are defined among those 101 vertices of the coarse model. Those ver-
tices correspond to the facial landmarks, such as nose tip, eye corners, mouth corners,
chin, upper line of the neck, bottom line of the neck, and so on. These landmarks imply
the structure information of the head, such as the height and width of the head, the posi-
tions of the ears and eyebrows, the position of the neck, and so on. In the cylindrical co-
ordinate space, those feature points divide the facial surface into many local rectangular
regions (Figure 5.4).
Figure 5.4 The landmarks divide the head surface into many local rectangular re-gions in the cylindrical coordinate space. (a) The boundary of the outer rectangle represents the boundary of the range map. The range map is divided into rectangu-
(a) (b)
65
lar regions whose corners are those landmarks and some points on the boundary. (b) The coarse model is drawn to overlap with the regions in the cylindrical coordi-nate space.
The user manually selects those thirty-one feature points on the texture data. An example
is shown in Figure 5.5. The selected feature points have their correspondences in the
coarse model and provide the coordinates for the landmarks. Once the feature points are
selected, 2D local scaling in both vertical and horizontal directions within each rectangu-
lar region is used to deform the coarse model in the cylindrical coordinates space.
66
Figure 5.5 Select feature points on the texture map.
In the cylindrical coordinate space, the coarse model triangulates the facial surface into
many local triangle patches. Each local triangle patch defines a local affine system. After
the coarse model is fitted, 2D local affine transformations are applied to warp the generic
model in the cylindrical coordinate space. The range values of the vertices are picked up
from the range map. The Cartesian coordinates of each vertex are then calculated from its
longitude, latitude, and range value.
The remaining head components (e.g., eyes, hair, ears, tongue, and teeth) are automati-
cally adjusted by shifting, rotating, and scaling. For example, the teeth model is shifted
according to the position of the feature point that represents the middle point of the lower
contour of the upper lip, and is scaled according to the width of the mouth decided by the
distance between two mouth corners. Manual adjustments on the fitted model are re-
quired where the range data are missed, which lead to the miscalculation of the sizes and
positions of the head components. Figure 5.6 shows an example of a semi-finished face
model after automatic calculations. An interactive interface is developed to adjust the
model. Usually, it will take about a few hours to adjust the model. Figure 5.7 shows some
examples of the face models after manual adjustment.
67
Figure 5.6 A semi-finished face model and the model editor.
Figure 5.7 Examples of the customized face model.
5.4 Face Deformation Control Model
A triangular control model (Figure 5.8) is defined by selecting a subset of vertices from
the generic face model. The control model consists of 115 vertices and 180 triangles. The
factors that govern the selection of control vertices inlcude physiology (the distribution of
facial muscles) as well as various practical considerations relating to the topology of the
68
generic face model. Using the same interface (Figure 5.6) for adjusting the semi-finished
model, the user can move the control points.
Figure 5.8 The control model.
The face model is deformed in an affine transformation like way. In the cylindrical coor-
dinate space, the control model triangulates the facial surface into many local triangle
patches, called control triangles. The vertices of the face model are distributed into those
triangle patches. When the shapes of the control triangles are changed, the coordinates of
other vertices are changed as shown in Figure 5.9.
Assuming a control triangle <P1, P2, P3> are deformed to < P´1, P´2, P´3> in the cylindri-
cal coordinate space (see Figure 5.9) and P´1, P´2, P´3 are the correspondences of P1, P2,
P3 respectively. We calculate the cylindrical coordinates of a point P´, which is the corre-
spondence of a point P inside triangle <P1, P2, P3>, by
+−′+−′+−′=′++=′
++=′
rrrrrrrrgggg
tttt
)()()( 333222111
332211
332211
λλλλλλ
λλλ (5.1)
where 111
321
321
gggttt
W = , 111
132
32
1 gggttt
W=λ ,
111
131
31
1 gggttt
W=λ ,
111
121
21
3 gggttt
W=λ .
69
Figure 5.9 Local affine transformation for facial surface deformation.
The model editor, shown in Figure 5.6, can be used to manually adjust the coordinates of
the control points. Figure 5.10 shows an example that uses the model editor to create a
facial expression.
Figure 5.10 Create facial shape using the model editor.
P3 (t3, g3, r3)
P1 (t1, g1, r1)
P2 (t2, g2, r2)
P (t, g, r)
P´1 (t´1, g´1, r´1)
P´3 (t´3, g´3, r´3) P´2 (t´2, g´2, r´2)
P´ (t´, g´, r´)
70
A library of facial shapes, which consists of expressions and visemes3 (Figure 5.11), is
created manually by an artist.
Figure 5.11 Examples of facial expressions and visemes. (a) smile, (b) disgust, (c) surprise, (d) laugh, (e) viseme ‘f’, (f) viseme ‘i’, (g) viseme ‘o’, and (h) viseme ‘e’.
5.5 Text Driven Face Animation
When text is used in communication, e.g., in the context of text-based electronic chatting
over the Internet or visual email, visual speech synthesized from text will greatly help
deliver information. Recent work on text driven face animation includes the work of
Cohen and Massaro [15], Ezzat and Poggio [26], and Waters and Levergood [83].
These works differ from each other in their face models and interpolation functions as
long as only face modeling and animation is regarded. Cohen and Massaro [15] use a pa-
3 A viseme is a generic facial shape that serves to describe a particular sound. A viseme is the visual equivalent of a phoneme.
(a) (b) (c) (d)
(e) (f) (g) (h)
71
rametric geometric face model and Löfqvist a facial articulatory gesture model for calcu-
lating the parameters of the face model.
Ezzat and Poggio [26] use facial images directly. They collect a set of facial images that
correspond to visemes. Those images are used as key frames. The pixel correspondences
between two key frames are calculated using the optical flow technique developed by
Bergen and Hingorani [5]. The face animation is achieved by morphing between key
frames based on the correspondences calculated. They adopted the morphing technique
proposed by Beier and Neely [4].
Waters and Levergood [83] use a geometric face model. A set of facial shapes is manu-
ally edited. Those facial shapes correspond to visemes and are used as key frames during
the animation procedure. The facial shapes between two key frames are calculated by
morphing between two key frames. The morphing parameters, or the weights of the key
frames, are calculated by a linear or nonlinear transformation of time t. A physics-based
technique for calculating vertex displacements is also described.
Similar to the work of Ezzat and Poggio [26] and that of Waters and Levergood [83], the
iFACE system adopts the key frame based face animation technique for text driven face
animation. The procedure of the text driven face animation is illustrated in Figure 5.12.
The iFACE system uses Microsoft Text-to-Speech (TTS) engine4 for text analysis and
speech synthesis. First, the text stream is fed into the TTS engine. TTS parses the text and
generates the corresponding phoneme sequence, the timing information of phonemes, and
the synthesized speech stream. Each phoneme is mapped to a viseme based on a lookup
table. Each viseme is a key frame. Therefore, the text is translated in to a key frame se-
quence. Face animation is done by the morphing technique described in [83].
4 Microsoft TTS is publicly available at the download page of Microsoft cooperation. The URL of the download page is being changed from time to time. Therefore, the URL is not provided here. The user can search for it at http://www.microsoft.com.
72
The key frames are located on the positions of one third of phoneme durations. The facial
deformations between two consecutive key frames are decided by interpolation. The
weights of the key frames are calculated by
10 )1( ttt kkfrrr
αα −+= (t0 < t < t1) (5.2)
where tfr
is the facial deformation at time t, 0tkr
and 1tkr
are the two key frame represented
as facial deformations, and α is calculated by
2
cos101
0
01
0
−−−
−−
=tttt
tttt π
α (5.3)
Figure 5.12 The architecture of text driven face animation.
The facial deformations are added to the neutral facial shape to obtain the final facial
shapes. Combining with a script of expression sequence, we can use a synthesize expres-
sive talking head, such a head that talks while nodding, blinking eyes, raising eyebrows,
Speech stream
Viseme Sequence
Text stream
Text to Speech Engine
Map phoneme to viseme
Animate the face model
Play speech stream
Analyze text
Phoneme sequence & Timing information
Synthesize speech
Generate key frame sequence
Key frame sequence
Synchronize
Phoneme sequence
Timing inform
ation
73
and so on. The facial deformation of each facial expression is treated as an additive key
frame. This method is reasonable when the expression only involves upper face deforma-
tion. When the expression involves lower face deformation, such as smiling, disgust, and
so on, this method might cause artifacts.
The iFACE system uses a label system that has 44 phonemes and 17 visemes in the
iFACE system. The phonemes and their viseme groups are shown in Table 5.1.
Table 5.1 Phoneme and viseme used in the iFACE system.
Phoneme Word Viseme group Phoneme Word Viseme group AA cot 14 IX kisses 5 AE bat 13 JH judge 4 AY buy 8 K kit 5 AW down 8 LL led 11 AO bought 15 EL bottle 11 OY boy 5 M mom 1 EH bet 8 N nun 5 EY bait 8 NX sing 8 AX bird 7 EN button 5 IH bit 8 P pop 1 IY beat 10 R red 9
OW boat 16 YU cute 12 UH book 7 S sister 5 AH but 8 SH shoe 4 UW lute 6 T butter 5
B bob 1 TH thief 3 CH church 4 V verve 2 D dad 5 W wet 6
DH they 3 Y yet 5 F fin 2 Z zoo 5 G gag 5 ZH measure 4
HX hay 8
SIL SILENCE 17
5.6 Off-line Speech-driven face animation
When human speech is used in one-way communication (e.g., news broadcasting over
networks), off-line speech driven talking face is required. The process of off-line speech-
driven face animation of the iFACE system is illustrated in Figure 5.13. A speech stream
74
is first recognized into a phoneme sequence. The timing information of the phoneme se-
quence is also recorded. Once the phoneme sequence and the timing information are
given, the iFACE system animates the face model using the key frame technique, which
is the same as text driven face animation.
Figure 5.13 The architecture of off-line speech-driven face animation.
Recognizing phoneme using speech signals alone requires a complicated continuous
speech recognizer. Moreover, the phoneme recognition rate and the timing information of
the phoneme may not be accurate enough. The text script associated with speech provides
the accurate word-level transcription, which can be used to reduce the complexity of the
phoneme recognition problem and improve the recognition rate. The iFACE system uses
a phoneme recognition and alignment tool that comes with HTK 2.0 for UNIX.
An example of off-line speech-driven face animation sequence is shown in Figure 5.14.
The image sequence corresponds to the word “animation.” There are 70 frames in total.
Only the mouth region of every other frame is shown in Figure 5.14.
Speech stream
Map phoneme sequence to viseme se-
quence
Animate the face model
Play speech stream
Offline phoneme recognition and
alignment
Phoneme sequence & Timing information
Generate key frame sequence
Key frame sequence & timing information
Synchronize
Viseme sequence & Timing information
Speech stream & Text transcript
75
Figure 5.14 An example of off-line speech-driven face animation. The images are shown in order according to time. The time increases from left to right and from top to bottom.
5.7 Real-Time Speech-driven face animation
So far, the iFACE system uses the key frame technique to animate the face model. Using
the method described in Section 2.3.1, a slight modification is required to enable the
iFACE system to adopt MU for face animation. In Chapter 4, two new real-time audio-to-
MUP mappings are presented. Combining the MU-based face animation and real-time
audio-to-MUP mapping, we can add real-time speech-driven face animation functionality
to the iFACE system. Figure 5.15 shows the synthesized image sequence of the word
76
“animation” using nonlinear real-time audio-to-MUP mapping. The speech segment of
the image sequence in Figure 5.14 is used.
Figure 5.15 An example of nonlinear real-time speech-driven face animation. The images are shown in order according to time. The time increases from left to right and from top to bottom. Only the mouth region of every other frame is shown.
Comparing Figures 5.14 and 5.15, it can be seen that the animation results of off-line
speech-driven face animation are smoother than those of real-time speech-driven face
animation with constant short time delay. This can be quickly noticed by just looking at
77
the last row images in those two figures. This result is expected because off-line speech-
driven face animation has whole speech contextual information while real-time speech-
driven face animation only uses fixed-length contextual speech information.
5.8 The iFACE System in the Distributed Collaborative Environments
The iFACE system was demonstrated twice on site in the Army Research Lab Sympo-
sium 2000 and the Army Research Lab Symposium 2001. Recently, a shoulder model
was added to the face model (Figure 5.16).
Figure 5.16 A shoulder model is added to the face model.
The iFACE system is used to support collaboration in a distributed environment, where
users are in different types of environments and use heterogeneous hardware platforms.
The collaborators are connected via wireless networks. Remote participants are repre-
sented as avatars in the system. The faces of the avatars are driven by speech.
There are personnel in the central base who are in charge of processing information from
field personnel, reasoning and planning. They use desktop PCs and see through head
mounted displays (Figure 5.17(a)). The field personnel are mobile units. They are respon-
sible for providing the latest field information and executing plans. They use a vehicle
based mobile computing station called MIC3E (Figure 10(b)(c)).5 MIC3E has space for
two persons. It is equipped with two Pentium III 500 MHz PCs with 128 MB memory,
running Windows NT 4.0. MIC3E has three displays that include a Pioneer 50 in main
screen and two 17 in desk screens. The users can switch the materials being displayed to
any of the three screens. Other mobile individuals are equipped with lightweight portable
devices such as laptops.
Figure 5.17 The iFACE system in a distributed collaborative environment. (a) Ava-tar in the head mounted display, (b) avatar in the desk screen of MIC3E, (c) avatar in the main screen of MIC3E.
In this distributed collaborative environment, our avatar system helps the collaboration of
users in heterogeneous conditions by providing an alternative approach for traditional
video based face-to-face interaction. The bandwidth saved by the avatar system is used
for transmitting other data.
(a) (b) (c)
78
5 MIC3E is built by Sytronics Inc.
79
CHAPTER 6
6 CONCLUSIONS AND FUTURE WORK
6.1 Summary
This dissertation describes an integrated framework for face modeling, facial motion
analysis and synthesis. The framework provides a systematic guideline for research on
face modeling and animation. The guideline contains the following steps.
The start point is to select a quantitative visual representation for facial deformations. The
visual representation should provide enough information for deforming face models and
be suitable for explaining real facial deformation. In this thesis, MU is adopted for
modeling facial deformations. MUs are learned from a set of labeled real facial de-
formations. Therefore, the MU is suitable for realistic face animation and encodes the
characteristics of facial deformations. Arbitrary facial deformation can be approximated
by a linear combination of MUs weighed by MUPs. It is shown that the MU-based face
animation technique is compatible with the key frame based animation technique and the
MPEG-4 face animation standards.
Then, the visual representation is used in facial motion analysis. The analysis results can
be used directly for face animation. A real-time robust MU-based facial motion tracking
algorithm is presented. The tracking algorithm integrates low-level information, which is
obtained by optical flow calculation techniques, and high-level knowledge, which is rep-
resented by MUs. The tracking results are represented as an MUP sequence, which can be
immediately used for MU-compatible face animation techniques.
Activities cohere typically beyond modalities. This is typically true for facial deforma-
tions and speech. The audio channel (speech) and the visual channel (facial deformation
sequence) are highly correlated. Given the facial deformation control model and facial
motion analysis tool, it is now possible to explore the quantitative association between
audio-track and facial behavior. A set of video of a speaking subject is collected. The
80
visual part of the video is processed by the MU-based facial motion tracking algorithm.
The results are represented as MUP sequences. The features of the audio tracks are calcu-
lated. Two real-time audio-to-visual mappings with constant short time delay are exam-
ined. One is a local linear mapping. The other is a local nonlinear mapping using MLP.
The framework is used to guide the development of a face modeling and animation sys-
tem, called iFACE [32]. The system provides functionalities for building face model for
any individual, text driven face animation, and offline and real-time online speech-driven
face animation.
6.2 Future Research
Future research should be conducted to improve the framework to develop a highly lip-
readable synthetic talking face for human auditory-visual speech perception studies and
human face-to-face communication in noisy environments.
6.2.1 Explore better visual representation
Continuous endeavor is required to investigate the best visual representation of facial
movements. Currently, PCA is used to learn MUs. PCA is a second-order technique that
assumes the data has a Gaussian distribution. One of its advantages is that it requires only
classical matrix manipulations and thus is computationally and conceptually simple.
However, the second-order information of facial movements is not enough for developing
highly lip-readable synthetic talking head.
One possible improvement is to use independent component analysis (ICA) [33] for MU
learning. ICA is a higher-order technique and assumes non-Gaussianity of the data. ICA
tries to find a representation that minimizes the statistical dependence of the components
of the representation. ICA may better capture the structure of facial motion than PCA.
We shall evaluate the goodness-of-fit results based on both the mean-squared errors be-
tween the approximation and the ground truth and subjective tests of human perceivers.
Currently, MUs only cover the lower part of the face. Future work can extend MUs to
encode the 3D information of the whole face. The 3D facial deformation data for training
81
3D MUs can be synchronously captured by multiple camera systems, for example the Vi-
sion-1 system.1
6.2.2 Improve and evaluate the facial motion tracking algorithm
New MUs will require the update of the MU-based facial motion tracking algorithm. If
3D MUs are finally available, a 3D MU-based facial motion tracking algorithm can be
developed and implemented. Theoretically, a 3D MU-based facial motion tracking algo-
rithm is presented in Section 3.6.
Facial motion tracking using multiple cameras will help to increase the robustness of the
tracking algorithm. Multiple cameras are especially helpful for handling occlusions.
Those cameras should work synchronously. The theory of the 3D MU-based facial mo-
tion tracking algorithm is developed and presented in Section 3.7.
The tracking algorithms presented in Sections 3.2, 3.6, and 3.7 require an accurate face
model (or the facial shape at its neutral state). This makes them difficult to generalize be-
cause it is not easy to obtain an accurate enough model for any individual. A new track-
ing algorithm, called 3D MU-BSV-based facial motion tracking algorithm, is presented
in Section 3.8. In contrast to the previous two 3D MU-based facial motion tracking algo-
rithms, the face model of an individual is first guessed by warping a generic face model
based on some manually selected facial feature points. The warped generic face model
may not well suit the face of the subject. However, its shape can be adjusted using a set
of BSVs of the face, which can be learned from real 3D face shape data by, for example,
PCA. The parameters of those basic facial shapes are unknown and will be adjusted dur-
ing tracking. Therefore, the tracking algorithm eventually estimates both the face model
and the parameters of face and facial motions.
Sections 3.6, 3.7, and 3.8 develop the theory for the above three tracking algorithms. No
experimental results are provided due to the following two practical issues: (1) 3D MUs
are not currently available because the equipment for collecting 3D facial motion training
1 http://www.vision1.com/.
82
data is not available in the lab; and (2) a working multiple camera system is not accessi-
ble. Once those conditions are fulfilled, the evaluations of the tracking algorithm can be
trivialized.
Note, the tracking algorithms developed in this thesis can be used to track not only face
and facial motion but also the motion of other non-rigid and highly articulated object
(e.g., human hands and body).
6.2.3 Refine audio-to-visual mapping
In the future, the personal digital assistant (PDA) will be very popular in wireless com-
munication. PDA will enable human face-to-face communication. However, in many
situations, limited bandwidth permits only audio (but not video) transmission. Using the
audio speech to animate a synthetic talking face provides an effective solution. It is im-
portant to develop a real-time speech driven highly lip-readable synthetic talking head.
Research can be conducted to investigate how to incorporate dynamic Bayesian networks
(DBNs) [91] into the audio-to-visual mapping. There has been increasing interest in ap-
plying DBNs to speech recognition in recent years [73], [91]. DBNs use factored state
representation, which requires exponentially fewer parameters than HMMs. Factored
state representation also enables DBNs to explicitly represent many phenomena that can-
not be directly modeled by HMMs, for example articulator positions, speaker-gender, and
speaking rate. Therefore, DBNs are more interpretable and computationally efficient
than HMMs. DBNs will enable us to train the audio-to-visual mapping in a more pur-
poseful way.
6.2.4 Human perception on synthetic talking face
It is human beings who will finally enjoy face animation. Human perceptual experiments
should be designed to develop and test hypotheses about stimulus characteristics of the
auditory visual speech signal related to enhanced human speech perception. Experimental
results should be fed back to guide the engineering research and improve the model of the
synthetic talking face. Future versions of the integrated framework for face model, facial
motion analysis, and synthesis research should include human in the loop.
83
6.2.5 Improve the tongue models
Besides facial animation, tongue animation contributes to enhanced speech understand-
ing. It is important to incorporate tongue motion in a version of our synthetic talking face
to obtain comparison perceptual data. Similar to the face model and animation research,
the visual representation of the tongue deformation should be first decided and learned
from real data. There is a publicly available X-ray Microbeam Speech Production Data-
base collected by the University of Wisconsin [85]. It consists of simultaneous acoustic
and kinematic recordings for speech collected from more than 50 normal American Eng-
lish speakers. To track kinematic signals, small gold pellets were used as markers and
glued to the following locations: along the midline length of the tongue, incisors, one mo-
lar tooth of the mandible, and in the midline at the vermillion border of each lip.
6.3 Improving the Key Frames of the iFACE System
Currently, the key frames of the iFACE system are created manually. Some regions of the
key frames are not well done, which degrades the animation results. When the experi-
mental conditions allow, those key frames can be refined by using real data. For example,
a subject can be carefully selected. His/her 3D facial deformations can be captured using
multiple camera systems while a set of markers is put on his/her face.
84
7 REFERENCES
[1] K. Aizawa and T. S. Huang, “Model-based image coding,” Proc. IEEE, vol. 83, pp.
259-271, Aug. 1995.
[2] S. Basu and A. Pentland, “A three-dimensional model of human lip motions trained
from video,” in Proc. IEEE Non-Rigid and Articulated Motion Workshop at
CVPR’97, San Juan, June 1997, pp. 46-53.
[3] S. Basu, N. Oliver, and A. Pentland, “3D modeling and tracking of human lip mo-
tions,” in Proc. ICCV’98, Bombay, India, January 1998.
[4] T. Beier and S. Neely, “Feature-based image metamorphosis,” in SIGGRAPH’92,
Chicago, IL, 1992, pp. 35-42.
[5] J. R. Bergen and R. Hingorani, “Hierarchical motion-based frame rate conversion,”
Technical Report, David Sarnoff Research Center, Princeton, New Jersey, April
1990.
[6] A. Blake, R. Cuiwen, and A. Zisserman, “Affine-invariant contour tracking with
automatic control of spatiotemporal scale,” in Proc. ICCV’93, Berlin Germany,
May 1993, pp. 66-75.
[7] A. Blake, M. A. Isard and D. Reynard, “Learning to track the visual motion of con-
tours,” Artificial Intelligence, vol. 78, pp. 101-134, 1995.
[8] M. Brand, “Voice puppetry,” in SIGGRAPH’99, 1999.
[9] C. Bregler and Y. Konig, “Eigenlips for robust speech recognition,” In Proc. Int.
Conference on Acoustic, Speech, Signal Processing, Adelaide, 1994, pp. 669-672.
[10] C. Bregler, M. Covell, and M. Slancy, “Video rewrite: Driving visual speech with
audio,” in SIGGRAPH’ 97, 1997.
85
[11] C. Carlson and O. Hagsand, “DIVE - A platform for multi-user virtual environ-
ments,” Computer and Graphics, vol. 17, no. 6, pp. 663-669, 1993.
[12] M. Chan, “Automatic lip model extraction for constrained contour-based tracking,”
in Proc. Int. Conf. of Image Processing, Kobe, Japan, 1999.
[13] C. S. Choi, Kiyoharu, H. Harashima, and T. Takebe, “Analysis and synthesis of
facial image sequences in model-based image coding,” IEEE Transaction on Cir-
cuits and Systems for Video Technology, vol. 4, pp. 257-275, June 1994.
[14] T. Chen, and R. R. Rao, “Audio-visual integration in multimodal communications,”
Proceedings of the IEEE, vol. 86, no. 5, pp. 837--852, May 1998.
[15] M. M. Cohen and D. W. Massaro, “Modeling coarticulation in synthetic visual
speech,” in Models and Techniques in Computer Animation, N.M. Thalmann and
D. Thalmann, eds. Tokyo: Springer-Verlag, 1993, p. 139-156.
[16] R. A. Cole, D. W. Massaro, J. de Villiers, B. Rundle, K. Shobaki,
J. Wouters, M. M. Cohen, J. E. Beskow, P. Stone, P. Connors,
A. Tarachow, and D. Solcher, “New tools for interactive speech and
language training: Using animated conversational agents in the
classrooms of profoundly deaf children,” in Proceedings of ESCA/SOCRATES
Workshop on Method and Tool Innovations for Speech
Science Education, London, UK, Apr 1999.
[17] T. F. Cootes, C. J. Taylor, et al., “Active shape models – their training and applica-
tion,” Computer Vision and Image Understanding, vol. 61, no. 1, pp. 38-59, Jan.
1995.
[18] T. F. Cootes, G. J. Edwards, and C. J. Taylor, “Active appearance models,” in H.
Burkhardt and B. Neumann, eds., 5th European Conference on Computer Vision,
vol. 2, 1998, pp. 484-498.
[19] M. Covell and C. Bregler, “Eigen-points”, in Proc. IEEE Int. Conf. on Image Proc-
essing, vol.3, 1996, pp 471-474.
86
[20] S. Curinga, F. Lavagetto, F. Vignoli, “Lip movements synthesis using time-delay
neural networks”, in Proc. EUSIPCO-96, Trieste, 1996.
[21] D. DeCarlo and D. Mataxas, “Optical flow constraints on deformable models with
applications to face tracking”, Int. Journal of Computer Vision, vol. 38, no. 2, pp.
99-127, 2000.
[22] P. Eisert, T. Wiegand, and B. Girod, “Model-aided coding: A new approach to in-
corporate facial animation into motion-compensated video coding,” IEEE Transac-
tions on Circuits and Systems for Video Technology, vol. 10, no. 3, pp. 344-358,
Apr. 2000.
[23] P. Ekman and W. V. Friesen, “Facial action coding system,” Palo Alto, Calif.: Con-
sulting Psychologists Press, Inc., 1978.
[24] P. Ekman, T. S. Huang, T.J. Sejnowski and J.C. Hager, eds., Final report to NSF
of the planning workshop on facial expression understanding, Human Interaction
Laboratory, University of California, San Francisco, March, 1993.
[25] I. A. Essa and A. Pentland, “Coding Analysis, Interpretation, and Recognition of
Facial Expressions,” IEEE Transaction Pattern Analysis and Machine Intelligence,
vol. 10, no. 7, pp. 757 - 763, Jul. 1997.
[26] T. Ezzat and T. Poggio, “Visual speech synthesis by morphing visemes”, Interna-
tional Journal of Computer Vision 38(1), pp. 45-57, 2000.
[27] O. Faugeras, Three-Dimensional Computer Vision: a Geometric Viewpoint, MIT
Press, 1993.
[28] T. Goto, M. Escher, C. Zanardi, N.M. Thalmann "MPEG-4 based animation with
face feature tracking". CAS '99 (Eurographics Workshop on Animation and Simula-
tion), Milano, Italy, September. 7-8 1999.
[29] B. Guenter et al. “Making faces”, in Proc. SIGGRAPH '98, 1998.
87
[30] P. Hong, “Facial expressions analysis and synthesis,” MS thesis, Computer Sci-
ence and Technology, Tsinghua University, July, 1997.
[31] P. Hong, T. Huang, and X. Lin, “Mouth motion learning and generating from ob-
servation,” in IEEE Workshop on Multimedia Signal Processing, Dec. 7-9, 1998.
[32] P. Hong, Z. Wen, and T. S. Huang, “iFACE: a 3D synthetic talking face,” Interna-
tional Journal of Image and Graphics, vol. 1, no. 1, pp. 1-8, 2001.
[33] A. Hyvärinen. “Survey on independent component analysis,” Neural Computing
Surveys, vol. 2, pp. 94-128, 1999.
[34] I. T. Jolliffe, Principal Component Analysis, Springer-Verlag, New York, 1986.
[35] P. Kalra, A. Mangili, N. Magnenat Thalmann, D. Thalmann, “Simulation of facial
muscle actions based on rational free form deformations,” in Proc. Eurographics
'92, pp. 59-69.
[36] M. Kass, A. Witkin and D. Terzopoulos, “Snakes: Active contour models,”
International Journal of Computer Vision, vol. 1, no. 4, pp. 321-331, 1988.
[37] R. Kaucic and A. Blake, “Accurate, real-time, unadorned lip tracking,” in Proc.
ICCV’98, pp. 370-375.
[38] M. Kirby and L. Sirovich, "Application of the Karhunen-Loeve procedure for the
characterization of human faces," IEEE Transaction Pattern Analysis and Machine
Intelligence, vol. 12, pp. 103-108, 1990.
[39] S. Kshirsagar and N. Magnenat-Thalmann, “Lip synchronization using linear pre-
dictive analysis,” in Proceedings of IEEE International Conference on Multimedia
and Expo, New York, August 2000.
[40] F. Lavagetto, “Converting speech into lip movements: A multimedia telephone for
hard of hearing people,” IEEE Transactions on Rehabilitation Engineering, vol. 3,
no. 1, March 1995.
88
[41] Y. C. Lee, D. Terzopoulos and K. Waters, “Realistic modeling for facial anima-
tion,” in SIGGRAPH’95, pp. 55-62.
[42] W. H. Leung, K. Goudeaux, S. Panichpapiboon, S. B. Wang and T. Chen, “Net-
worked intelligent collaborative environment (NetICE),” in Proceeding of IEEE
Intl. Conf. on Multimedia and Expo., New York, 2000.
[43] J. P. Lewis, “Automated lip-sync: Background and techniques,” J. Visualization
and Computer Animation, vol. 2, pp. 118-122, 1991.
[44] H. Li, P. Roivainen and R. Forchheimer, “3-D motion estimation in model-based
facial image coding,” IEEE Trans. On Pattern Analysis and Machine Intelligence,
vol. 15, no. 6 pp. 545-555, 1993.
[45] A. Löfqvist, “Speech as audible gestures,” In W. J. Hardcastle and A. Marchal,
eds., Speech Production and Speech Modeling, Dordrecht: Kluwer Academic Pub-
lishers, pp. 289-322.
[46] B. D. Lucas and T. Kanade, “An iterative image registration technique with an ap-
plication to stereo vision,” in Proceedings of International Joint Conference on Ar-
tificial Intelligence, pp. 674-679, 1981.
[47] J. Mandeville, J. Davidson, D. Campbell, et al., “A shared virtual environment for
architectural design review,” CVE'96 Workshop Proceedings, Nottingham, UK,
1996.
[48] D. W. Massaro, Speech Perception by Ear and Eye: A Paradigm for Psychological
Inquiry, Hillsdale, NJ: Lawrence Erlbaum Associates, 1987.
[49] D. W. Massaro, Perceiving Talking Faces, MIT Press, 1998.
[50] D. W. Massaro, J. Beskow, et al. “Picture my voice: audio to visual speech synthe-
sis using artificial neural networks”, in Proc. AVSP'99, Santa Cruz, USA.
89
[51] K. Matsuno and S. Tsuji, "Recognizing human facial expressions in a potential
field," in Proc. ICPR, 1994, pp. 44-49.
[52] I. Matthews, T. Cootes, et al., “Lipreading from shape shading and scale,” in Proc.
Auditory-Visual Speech Processing, Terrigal, Australia, 1998, pp.73-78.
[53] S. Morishima, K. Aizawa and H. Harashima, “An intelligent facial image coding
driven by speech and phoneme,” in Proc. IEEE ICASSP, Glasgow, UK, 1989, pp.
1795.
[54] S. Morishima and H. Harashima, “A media conversion from speech to facial image
for intelligent man-machine interface”, IEEE J. Selected Areas in Communications,
vol. 4, pp. 594-599, 1991.
[55] S. Morishima, “Real-time talking head driven by voice and its application to com-
munication and entertainment,” in Proceedings of the International Conference on
Auditory-Visual Speech Processing, 1998, Terrigal, Australia.
[56] J. L. Mundy and A. Zisserman. Geometric Invariance in Computer Vision. MIT
Press, 1992
[57] K. Nagao and A. Takeuchi, “Speech dialogue with facial displays,” in Proc. 32nd
Annual Meeting of the Asso. for Computational Linguistics, 1994, pp. 102-109.
[58] M. Nahas, H. Huitric, and M. Saintourens, “Animation of a B-spline figure,” The
Visual Computer, vol. 3, pp. 272-276, 1988.
[59] G. M. Nielson, “Scattered Data Modeling,” IEEE Computer Graphics and Applica-
tions, vol. 13, no. 1, pp. 60-70, 1993.
[60] NTT Software Corporation Interspace, 3D virtual environment.
[61] I. Pandzic, J. Ostermann, D. Millen, “User evaluation: Synthetic talking faces for
interactive services,” The Visual Computer, vol. 15, issue 7/8, pp. 330-340, No-
vember 1999.
90
[62] F. I. Parke, “A parametric model of human faces,” Ph.D. thesis, University of Utah,
1974.
[63] F. I. Parke, “A parameterized model for facial animation”, IEEE Computer Graph-
ics and Applications, vol. 2, no. 9, pp. 61-70, 1982.
[64] F. I. Parke and K. Waters. Computer Facial Animation. AKPeters, Wellesley, Mas-
sachusetts, 1996.
[65] A. Pearce, B. Wyvill, G. Wyvill, and D. Hill, “Speech and expression: A computer
solution to face animation,” Graphics Interface 1986.
[66] C. Pelachaud, N. I. Badler, and M. Steedman, “Linguistic issues in facial anima-
tion,” in N. M. Thalmann and D. Thalmann, eds. Computer Animation ’91 Tokyo:
Springer-Verlag.
[67] F. Pighin, et al., “Synthesizing realistic facial expressions from photographs”, in
Proc. SIGGRAPH ’98, 1998.
[68] S. M. Platt and N. I. Badler, “Animating facial expression,” in SIGGRAPH’81, pp.
245-252.
[69] L. R. Rabiner, “A tutorial on hidden Markov models and selected applications in
speech recognition,” Proc. of the IEEE, vol. 77, no. 2, pp. 257-286, 1989.
[70] R. Rao and T. Chen, “Exploiting audio-visual correlation in coding of talking head
sequences,” in Picutre Coding Symposium’ 96, Melbourne, Australia, March 1996.
[71] L. Reveret and C. Benoit “A new 3D lip model for analysis and synthesis of lip
motion in speech production,” in Proc. of the Second ESCA Workshop on Audio-
Visual Speech Processing, Terrigal, Australia, Dec. 1998.
[72] J. Shi and C. Tomasi, “Good features to track,” in Proceedings of IEEE Conference
on Computer Vision and Pattern Recognition, 1994, pp. 593-600.
91
[73] T. A. Stephenson, H. Bourlard, S. Bengio and A. C. Morris, “Automatic speech
recognition using dynamic Bayesian Networks with both acoustic and articulatory
variables,” in Proceedings of 6th International Conference on Spoken Language
Processing, 2000.
[74] D. G. Stork and M. E. Hennecke, eds., Speechreading By Humans and Machines,
NATO ASI Series, Springer, 1996.
[75] H. Tao and T. S. Huang, “Explanation-based facial motion tracking using a piece-
wise Bezier volume deformation model,” in Proc. IEEE Computer Vision and Pat-
tern Recognition, 1999.
[76] D. Terzopoulos and K. Waters, “Techniques for realistic facial modeling and ani-
mation,” In M. Magnenat-Thalmann and D. Thalmann, eds., Computer Animation
’91, Tokyo, 1991. Springer-Verlag.
[77] D. Terzopoulos and K. Waters, “Analysis and synthesis of the facial image se-
quences using physical and anatomical models,” IEEE Transaction on Pattern
Analysis and Machine Intelligence, vol. 15, no. 6, pp. 569 - 579, Jun. 1993.
[78] C. Tomasi and T. Kanade, “Detection and tracking of point features,” Carnegie
Mellon University Technical Report CMU-CS-91-132, April 1991.
[79] M. Turk and A. Pentland, “Eigenfaces for recognition,” Journal of Cognitive Neu-
roscience, pp. 71 - 86, 1991.
[80] M. L. Viaud and H. Yahia, “Facial animation with wrinkles,” in Third Workshop
on Animation, Eurographics ’92, Cambridge, 1992.
[81] F. Vignoli, S. Curinga, and F. Lavagetto, “A neural clustering architecture for esti-
mating visible articulatory trajectories,” in Proc. ICANN96, Bochum, July 1996,
pp. 863-869.
[82] K. Waters, “A muscle model for animating three-dimensional facial expressions,”
Computer Graphics, vol. 21, no. 4, pp. 17-24, July 1987.
92
[83] K. Waters and T. M. Levergood, “DECface, an automatic lip-synchronization algo-
rithm for synthetic faces,” Digital Equipment Corporation, Cambridge Research
Lab, Technical Report CRL 93-4.
[84] K. Waters, J. M. Rehg, M. Loughlin, et al., “Visual sensing of humans for active
public interfaces,” Cambridge Research Lab, Technical Report CRL 96-5.
[85] J. Westbury, E. J. Severson, and M. Hashi, X-ray microbeam speech production
database user’s handbook, Madison, WI. 1994.
[86] L. Williams, “Performance-driven facial animation”, Computer Graphics, no. 24,
vol. 2, pp. 235-242, Aug. 1990.
[87] H. Yehia, P. Rubin, and E.V. Bateson, “Quantitative association of vocal-tract and
facial behavior”, Speech Communication, vol. 26, pp. 23-43, 1998.
[88] A. Yullie, P. Hallinan, and D. Cohen, “Feature extraction from faces using deform-
able templates,” Int. Journal of Computer Vision, vol. 8, no. 2, pp. 99-111, 1992.
[89] S. Zacks, The Theory of Statistical Inference. Wiley, New York. 1971.
[90] Z. Wen, Tongue and teeth modeling for face modeling and animation, Master The-
sis, Computer Science, University of Illinois at Urbana Champaign, 199.
[91] G. Zweig, “Speech recognition with dynamic Bayesian networks,” Ph.D. thesis,
Computer Science, UC Berkeley, 1998.
[92] “Text for CD 14496-2 Video,” ISO/IEC JTC1/SC29/WG11 N1902, Nov. 1997.
93
8 VITA
Pengyu Hong was born on May 16, 1973, in Zhangzhou, P. R. China. He received the
Bachelor of Engineering degree and Master of Engineering degree from Tsinghua Uni-
versity, Beijing, China, in 1995 and 1997 respectively. Both degrees are in Computer
Science.
In August 1997, Mr. Hong joined the Ph.D. program at the Department of Computer Sci-
ence of the University of Illinois at Urbana-Champaign, Urbana, Illinois, US. He works
as a research assistant in the Image Formation and Processing Laboratory at the Beckman
Institute for Advanced Science and Technology.
His research interest covers a broad scope in image and video processing, human com-
puter interaction, computer graphics, computer vision and pattern recognition, and ma-
chine learning. He is the senior author of 20 technical papers, and two book chapters.
Mr. Hong's research focuses on pattern recognition, computer vision and computer graph-
ics with their applications in Human Computer Interaction. His work on face modeling,
facial motion analysis, and synthesis results in a face-based multimedia information con-
version interface. His work on unsupervised pattern extraction automatically searches for
temporal and spatial regularities in a large database.