QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Towards Perceptually Realistic Talking Heads: Models, Methods and McGurk
David Marshall, Darren Cosker and Paul Rosin
Cardiff School of Computer Science
Susan Paddock and Simon Rushton
Cardiff School of Psychology
Cardiff University
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Context: A Talking Head• Development of a Video-Realistic Talking
Head• Animation from Continuous Speech• Perceptual Analysis -> Realism
QuickTime™ and aCinepak decompressor
are needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Contribution of this Paper: Perceptual Realism Test
• Perceptual Analysis via McGurk Test• Perceptual Test with no prior bias• Used to improve talking head
synthesis
QuickTime™ and aSorenson Video decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Outline of Talk
• Video Realistic Talking Head (Overview)
• Perceptual Analysis and Testing• The McGurk Effect + McGurk Test• Results : Implications of McGurk• Conclusions + Future Work
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Our Talking Head
• Image based synthesis• Continuous Speech• Flexible framework – emotion, behaviour
BASIC IDEA:• Train on input video and audio
• Extracting only low level image and audio features• No phonetic labelling
• Synthesise new video using only input audio• Unseen utterances• Speaker Independent
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Hierarchical Facial Model
• Active Appearance Models – Control of shape and texture using single ‘appearance parameter’
• Based on Principal Component Analysis (PCA)
• Non-linear Hierarchical PCA (developed at Cardiff)
• Greater Separation of Variation
• High Degree of Control – Sub-Facial variation not orthogonal in standard PCA model
• Coupling of Speech Model (Cardiff Idea)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Building A Talking Head - Initialisation
For Each Video Frame Extract:• Shape – Key Landmark Points (Tracker Helps)• Textures – Colour Pixel Values Normalised to Shape• Speech Features – Mel-Cepstral, Linear Predictive Coding (LPC)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Building A Talking Head - Tracking
QuickTime™ and aCinepak decompressor
are needed to see this picture.
Semi Automated• Hand Place Few Frames• Build Interim Shape Model• Track Other Frames• Build Final Shape Model
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Building A Talking Head - Learning/Model Building
Active Appearance Model (AAM)-> Shape (PCA) and Texture (PCA)Speech/Appearance Model (SAAM NEW) -> Speech (PCA) and AAM
Nonlinear PCA:• Gaussian Mixture Model (GMM)
Model of Dynamics:• Hidden Markov Model (HMM)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Building A Talking Head - Synthesis + Reconstruction
Input Speech -> Extract Speech Features + Find Best Clusters
Bottom up reconstruction: Mouth Driven
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Talking Head Examples
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Talking Head Example:Independent Speaker
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads
Current Talking Head Analysis Methods
• Subjective Evaluation• Analyse and Compare Trajectories• Improved Perception in Noisy
environments• Forced Choice Testing
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking HeadsSubject and Trajectory Evaluation
• Analyse and Compare Trajectories
• Ground truth quantitative assessment
• Comparison to “seen” data
• No perceptual quality measurement
• Subjective Evaluation• Does it “look good”?• No formative comparison• No feedback to improve model
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads:Noisy Environment Evaluation
• Noisy Environment Evaluation• Perceptual Evaluation• Compare Performance of Synthetic v Real Talking Head in realistic situations• Good overall test of talking head
• Lip-syncing, realism• No Quantitative Measure of Performance
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads:Forced Choice Testing
Forced Choice Testing:• Users Asked if Video is Real or Synthetic
• Only says if it looks realistic + lip sync is good• Big Prior Introduced
• Users look for artefacts• Randomness Bias in User selection
• Bored/Uninterested User• No Quantitative Feedback for Model
Improvement• What makes it real/synthetic?
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads:An New McGurk Test
• McGurk Test for Perceptual Analysis
• Subject doesn’t develop a prior
• Helps address strengths and weaknesses
• Suggests improvements based on these
• Compliments other tests
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads:The McGurk Effect
MacDonald and McGurk (1976):• Auditory Syllable Dubbed onto Videotape
of Different Syllables Gives Perception of and Entirely Different Syllable, e.g.:• Audio ‘Ba’• Visual ‘Ga’• Perception ‘Da’
• “Close Eyes – Illusion Vanishes”• Raises Psychological Audio-Visual
questions:• How is Auditory and Visual Stimuli combined?• Why combine when audio is enough?
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads:Some More McGurk Effect Examples
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking HeadsMcGurk Effect Examples (REAL)
QuickTime™ and aSorenson Video decompressorare needed to see this picture.
QuickTime™ and aSorenson Video decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads:McGurk Effect Examples (ANSWERS)
Tuple:Bent/Vest/Vent Tuple:Mat/Dead/Gnat
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking HeadsMcGurk Effect Examples (Synthetic)
QuickTime™ and aSorenson Video decompressorare needed to see this picture.
QuickTime™ and aSorenson Video decompressorare needed to see this picture.
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads: McGurk Effect Examples (ANSWERS)
Synthetic Examples
Tuple: Fame/Face/Feign Tuple: Mat/Dead/Gnat
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads:Our McGurk Test
McGurk Perceptual Evaluation Test:
• Mix Real and Synthetic tuples.• What word do you perceive?• Users asked to note anything differences
• NO PRIORS as to real/synthetic forced choice• User only asked about they hear/perceive
• Best Viewing resolution• Tested different resolutions (72x75, 36x289,
720x576 pixels)
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads:Our McGurk Experimental Procedure• Mix of Real and Synthetic McGurk Examples
• Real examples are a control• Users Presented with a series of 60 (30 real 30 Synthetic) random
examples• Users asked only to focus on the mouth area• Two initial example “training” sequences (not in trial)• Soundproofed booths with adjustable volume and artificial lighting• Replay option for all example• Users simply record the word they perceive• Users asked three questions after viewing all clips
• “Did you notice anything about the videos that you can comment on?”
• “Could you tell that some of the videos were computer generated?”
• “Did you use the replay button at all?”• 20 psychology undergrad test subjects (4 Male/16 female) with
normal hearing/vision
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads: How is Our McGurk Test a Test
• How is this a test?• Correct Lip Synch = McGurk Effect• Incorrect Lip Synch = Audio/Other
• Audio should be dominant• Questions Assess Behaviour/Output
• After test procedure participants asked whether they noticed anything unnatural?
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads:Results
Four Types of Analysis of Results:
• Standard McGurk Response• From tuples form accepted audio and accepted McGurk response• Original McGurk observation
• Enhanced McGurk Response• Assemble a List of All participants McGurk Reponses• Allows for greater variability in accents/articulation• Allows for greater analysis and Improvement of Head Models
• Effects of Resolution on McGurk Effect
• End of Test Questions Analysis• General overall response, qualitative analysis
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads: Standard McGurk Response
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads:Enhanced McGurk Response
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads:Image Resolution
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads:End of Test Questions Results
• “Notice anything to comment on?”Some audio didn’t match video
• “Could you tell some synthetic?”No, 1 participant = some
unnatural?• “Did you use replay?”
Few = once, One = twice
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads:Overall Results Analysis
• Realistic behaviour • Most users were unaware of synthetic output
• More McGurk effects in real output• Points to some weakness in model
• Good Synthesis of /F/, /D/, /S/, /A/ and /E/• Poor Synthesis of /V/
• Some weak real and synthetic McGurk responses• Beige-Gaze-Deige -> 2X Audio v McGurk• Mock-Dock-Knock -> 50:50 Audio:McGurk
• Resolution has effect on real only• Due to overall lower synthetic McGurk response
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Conclusions
• Suggested a perceptual approach to analysis and development of a Talking Head• Unbiased by prior forced choice making• Insight into performance of algorithms
• Complements other tests
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Perceptual Analysis of Talking Heads:Future Work
• Talking Head• Full Emotion• Performance Driven Animation• 3D Modelling• Full 3D appearance modelling
• Other perceptual tests• Longer videos – McGurk sentences• Real/Synthesised correct lip synch:
McGurk = bad synch?• Emotion – A McGurk emotion test?
QuickTime™ and aTIFF (Uncompressed) decompressor
are needed to see this picture.APGV04
Web Links
• Paper Downloads
www.cs.cf.ac.uk/user/D.P.Cosker/publications.htmlwww.cs.cf.ac.uk/Dave/Publications.html
• McGurk Video Clips and McGurk Test Software (Macromedia Director)
www.cs.cf.ac.uk/user/D.P.Cosker/McGurk/