+ All Categories
Home > Documents > Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal...

Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal...

Date post: 17-Dec-2015
Category:
Upload: dorothy-hall
View: 219 times
Download: 1 times
Share this document with a friend
Popular Tags:
33
Automatic Lip- Automatic Lip- Synchronization Using Synchronization Using Linear Prediction of Linear Prediction of Speech Speech Christopher Kohnert Christopher Kohnert SK Semwal SK Semwal University of Colorado, Colorado University of Colorado, Colorado Springs Springs
Transcript
Page 1: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Automatic Lip-Automatic Lip-Synchronization Using Synchronization Using

Linear Prediction of Linear Prediction of SpeechSpeech

Christopher Kohnert Christopher Kohnert SK SemwalSK Semwal

University of Colorado, Colorado University of Colorado, Colorado SpringsSprings

Page 2: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Topics of PresentationTopics of Presentation

Introduction and BackgroundIntroduction and Background Linear Prediction TheoryLinear Prediction Theory Sound SignaturesSound Signatures Viseme ScoringViseme Scoring Rendering SystemRendering System ResultsResults ConclusionsConclusions

Page 3: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

JustificationJustification

Need:Need: Existing methods are labor intensiveExisting methods are labor intensive Poor resultsPoor results ExpensiveExpensive

Solution:Solution: Automatic methodAutomatic method ““Decent” resultsDecent” results

Page 4: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Applications of Applications of Automatic SystemAutomatic System

Typical applications benefiting from Typical applications benefiting from an automatic method:an automatic method: Real-time video communicationReal-time video communication Synthetic computer agentsSynthetic computer agents Low-budget animation scenarios:Low-budget animation scenarios:

Video games industryVideo games industry

Page 5: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Automatic Is PossibleAutomatic Is Possible

Spoken word is broken into Spoken word is broken into phonemesphonemes Phonemes are comprehensivePhonemes are comprehensive

Visemes are visual correlatesVisemes are visual correlates Used in lip-reading and traditional Used in lip-reading and traditional

animationanimation

Page 6: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Existing Methods of Existing Methods of SynchronizationSynchronization

Text BasedText Based Analyze text to extract phonemesAnalyze text to extract phonemes

Speech BasedSpeech Based Volume trackingVolume tracking Speech recognition front-endSpeech recognition front-end Linear PredictionLinear Prediction

HybridsHybrids Text & SpeechText & Speech Image & SpeechImage & Speech

Page 7: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Speech Based is BestSpeech Based is Best

Doesn’t need scriptDoesn’t need script Fully automaticFully automatic Can use original sound sample (best Can use original sound sample (best

quality)quality) Can use source-filter modelCan use source-filter model

Page 8: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Source-Filter ModelSource-Filter Model Models a sound signal as a source passed Models a sound signal as a source passed

through a filterthrough a filter Source: lungs & vocal cordsSource: lungs & vocal cords Filter: vocal tractFilter: vocal tract

Implemented using Linear PredictionImplemented using Linear Prediction

Page 9: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Speech Related TopicsSpeech Related Topics

Phoneme recognitionPhoneme recognition How many to use?How many to use?

Mapping phonemes to visemesMapping phonemes to visemes Use visually distinctive ones (e.g. vowel Use visually distinctive ones (e.g. vowel

sounds)sounds) Coarticulation effectCoarticulation effect

Page 10: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

The Coarticulation EffectThe Coarticulation Effect

The blending of sound based on The blending of sound based on adjacent phonemes (common in adjacent phonemes (common in every-day speech)every-day speech)

Artifact of discrete phoneme Artifact of discrete phoneme recognitionrecognition

Causes poor visual synchronization Causes poor visual synchronization (transitions are jerky and unnatural)(transitions are jerky and unnatural)

Page 11: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Speech Encoding Speech Encoding MethodsMethods

Pulse Code Modulation (PCM)Pulse Code Modulation (PCM) VocodingVocoding Linear PredictionLinear Prediction

Page 12: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Pulse Code ModulationPulse Code Modulation

Raw digital samplingRaw digital sampling High quality soundHigh quality sound Very high bandwidth requirementsVery high bandwidth requirements

Page 13: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

VocodingVocoding

Stands for VOice-enCODingStands for VOice-enCODing Origins in military applicationsOrigins in military applications Models physical entities (tongue, Models physical entities (tongue,

vocal cord, jaw, etc.)vocal cord, jaw, etc.) Poor sound quality (tin can voices)Poor sound quality (tin can voices) Very low bandwidth requirementsVery low bandwidth requirements

Page 14: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Linear PredictionLinear Prediction

Hybrid method (of PCM and Vocoding)Hybrid method (of PCM and Vocoding) Models sound source and filter Models sound source and filter

separatelyseparately Uses original sound sample to Uses original sound sample to

calculate recreation parameters calculate recreation parameters (minimum error)(minimum error)

Low bandwidth requirementsLow bandwidth requirements Pitch and intonation independencePitch and intonation independence

Page 15: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Linear Prediction TheoryLinear Prediction Theory

Source-Filter modelSource-Filter model PP coefficients are calculated coefficients are calculated

SourceFilter

Page 16: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Linear Prediction Theory Linear Prediction Theory (cont.)(cont.)

The The aakk coefficients are found by coefficients are found by minimizing the original sound (minimizing the original sound (SStt) ) and the reconstructed sound (and the reconstructed sound (ssii).).

Can be solved using Levinson-Durbin Can be solved using Levinson-Durbin recursion.recursion.

Page 17: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Linear Prediction Theory Linear Prediction Theory (cont.)(cont.)

Coefficients represent the filter partCoefficients represent the filter part The filter is assumed constant for The filter is assumed constant for

small “windows” on the original small “windows” on the original sample (10-30ms windows)sample (10-30ms windows)

Each window has its own Each window has its own coefficientscoefficients

Sound source is either Pulse Train Sound source is either Pulse Train (voiced) or white noise (unvoiced)(voiced) or white noise (unvoiced)

Page 18: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Linear Prediction for Linear Prediction for RecognitionRecognition

Recognition on raw coefficients is Recognition on raw coefficients is poorpoor

Better to FFT the valuesBetter to FFT the values Take only first “half” of FFT’d valuesTake only first “half” of FFT’d values This is the “signature” of the soundThis is the “signature” of the sound

Page 19: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Sound SignaturesSound Signatures

16 values represent the sound16 values represent the sound Speaker independentSpeaker independent Unique for each phonemeUnique for each phoneme Easily recognized by machineEasily recognized by machine

Page 20: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Viseme ScoringViseme Scoring

Phonemes were chosen judiciouslyPhonemes were chosen judiciously Map one-to-one to visemesMap one-to-one to visemes Visemes scored independently using Visemes scored independently using

historyhistory VVii= 0.9 * V= 0.9 * Vi-1i-1 + 0.1 * {1 if matched at + 0.1 * {1 if matched at ii, ,

else 0}else 0} Ramps up and down with successive Ramps up and down with successive

matches/mismatchesmatches/mismatches

Page 21: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Rendering SystemRendering System Uses Alias|Wavefront’s Maya Uses Alias|Wavefront’s Maya

packagepackage Built-in support for “blend Built-in support for “blend

shapes”shapes” Mapped directly to viseme Mapped directly to viseme

scoresscores Very expressive and flexibleVery expressive and flexible

Script generated and later Script generated and later read inread in

Rendered to movie, QuickTime Rendered to movie, QuickTime used to add in original sound used to add in original sound and produce final movie.and produce final movie.

Page 22: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Results (Timing)Results (Timing)

Precise timing can Precise timing can be achievedbe achieved

Smoothing Smoothing introduces “lag”introduces “lag”

QuickTime™ and a H.263 decompressor are needed to see this picture.

QuickTime™ and a MPEG-4 Video decompressor are needed to see this picture.

Page 23: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Results (Other Examples)Results (Other Examples)

A female speaker A female speaker using male using male phoneme setphoneme set

QuickTime™ and a Cinepak decompressor are needed to see this picture.

Slower speech, male speaker

QuickTime™ and a MPEG-4 Video decompressor are needed to see this picture.

Page 24: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Results (Other Examples) Results (Other Examples) (cont.)(cont.)

Accented speech Accented speech with fast pacewith fast pace

QuickTime™ and a MPEG-4 Video decompressor are needed to see this picture.

Page 25: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Results (Summary)Results (Summary)

Good with basic speechGood with basic speech Good speaker independence (for Good speaker independence (for

normal speech)normal speech) Poor performance when speech:Poor performance when speech:

Is too fastIs too fast Is accentedIs accented Contains phonemes not in the reference Contains phonemes not in the reference

set (e.g. “w” and “th”)set (e.g. “w” and “th”)

Page 26: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

ConclusionConclusion

Linear Prediction provides several Linear Prediction provides several benefits:benefits: Speaker independenceSpeaker independence Easy to recognize automaticallyEasy to recognize automatically

Results are reasonable, but can be Results are reasonable, but can be improvedimproved

Page 27: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Future WorkFuture Work

Identify best set of phonemes and Identify best set of phonemes and visemesvisemes

Phoneme classification could be Phoneme classification could be improved with better matching improved with better matching algorithm (neural net?)algorithm (neural net?)

Larger phoneme reference set for Larger phoneme reference set for more robust matchingmore robust matching

Page 28: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

ResultsResults

Simple cases work very wellSimple cases work very well Timing is good and very responsiveTiming is good and very responsive

Robust with respect to speakerRobust with respect to speaker Cross-gender, multiple male speakersCross-gender, multiple male speakers Fails on: accents, speed, unknown Fails on: accents, speed, unknown

phonemesphonemes Problems with noisy samplesProblems with noisy samples

Can be smoothed but introduces “lag”Can be smoothed but introduces “lag”

Page 29: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

End

Page 30: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Automatic Is PossibleAutomatic Is Possible

Spoken word is broken into Spoken word is broken into phonemesphonemes Phonemes are comprehensivePhonemes are comprehensive

Visemes are visual correlatesVisemes are visual correlates Used in lip-reading and traditional Used in lip-reading and traditional

animationanimation Physical speech (vocal cords, vocal Physical speech (vocal cords, vocal

tract) can be modeledtract) can be modeled Source-filter modelSource-filter model

Page 31: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Sound Signatures Sound Signatures (Speaker Independence)(Speaker Independence)

Page 32: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Sound Signatures (For Sound Signatures (For Phonemes)Phonemes)

QuickTime™ and a None decompressor are needed to see this picture.

Page 33: Automatic Lip- Synchronization Using Linear Prediction of Speech Christopher Kohnert SK Semwal University of Colorado, Colorado Springs.

Results (Normal Speech)Results (Normal Speech)

Normal speech, Normal speech, moderate pacemoderate pace

QuickTime™ and a Cinepak decompressor are needed to see this picture.


Recommended