Date post: | 17-Dec-2015 |
Category: |
Documents |
Upload: | dorothy-hall |
View: | 219 times |
Download: | 1 times |
Automatic Lip-Automatic Lip-Synchronization Using Synchronization Using
Linear Prediction of Linear Prediction of SpeechSpeech
Christopher Kohnert Christopher Kohnert SK SemwalSK Semwal
University of Colorado, Colorado University of Colorado, Colorado SpringsSprings
Topics of PresentationTopics of Presentation
Introduction and BackgroundIntroduction and Background Linear Prediction TheoryLinear Prediction Theory Sound SignaturesSound Signatures Viseme ScoringViseme Scoring Rendering SystemRendering System ResultsResults ConclusionsConclusions
JustificationJustification
Need:Need: Existing methods are labor intensiveExisting methods are labor intensive Poor resultsPoor results ExpensiveExpensive
Solution:Solution: Automatic methodAutomatic method ““Decent” resultsDecent” results
Applications of Applications of Automatic SystemAutomatic System
Typical applications benefiting from Typical applications benefiting from an automatic method:an automatic method: Real-time video communicationReal-time video communication Synthetic computer agentsSynthetic computer agents Low-budget animation scenarios:Low-budget animation scenarios:
Video games industryVideo games industry
Automatic Is PossibleAutomatic Is Possible
Spoken word is broken into Spoken word is broken into phonemesphonemes Phonemes are comprehensivePhonemes are comprehensive
Visemes are visual correlatesVisemes are visual correlates Used in lip-reading and traditional Used in lip-reading and traditional
animationanimation
Existing Methods of Existing Methods of SynchronizationSynchronization
Text BasedText Based Analyze text to extract phonemesAnalyze text to extract phonemes
Speech BasedSpeech Based Volume trackingVolume tracking Speech recognition front-endSpeech recognition front-end Linear PredictionLinear Prediction
HybridsHybrids Text & SpeechText & Speech Image & SpeechImage & Speech
Speech Based is BestSpeech Based is Best
Doesn’t need scriptDoesn’t need script Fully automaticFully automatic Can use original sound sample (best Can use original sound sample (best
quality)quality) Can use source-filter modelCan use source-filter model
Source-Filter ModelSource-Filter Model Models a sound signal as a source passed Models a sound signal as a source passed
through a filterthrough a filter Source: lungs & vocal cordsSource: lungs & vocal cords Filter: vocal tractFilter: vocal tract
Implemented using Linear PredictionImplemented using Linear Prediction
Speech Related TopicsSpeech Related Topics
Phoneme recognitionPhoneme recognition How many to use?How many to use?
Mapping phonemes to visemesMapping phonemes to visemes Use visually distinctive ones (e.g. vowel Use visually distinctive ones (e.g. vowel
sounds)sounds) Coarticulation effectCoarticulation effect
The Coarticulation EffectThe Coarticulation Effect
The blending of sound based on The blending of sound based on adjacent phonemes (common in adjacent phonemes (common in every-day speech)every-day speech)
Artifact of discrete phoneme Artifact of discrete phoneme recognitionrecognition
Causes poor visual synchronization Causes poor visual synchronization (transitions are jerky and unnatural)(transitions are jerky and unnatural)
Speech Encoding Speech Encoding MethodsMethods
Pulse Code Modulation (PCM)Pulse Code Modulation (PCM) VocodingVocoding Linear PredictionLinear Prediction
Pulse Code ModulationPulse Code Modulation
Raw digital samplingRaw digital sampling High quality soundHigh quality sound Very high bandwidth requirementsVery high bandwidth requirements
VocodingVocoding
Stands for VOice-enCODingStands for VOice-enCODing Origins in military applicationsOrigins in military applications Models physical entities (tongue, Models physical entities (tongue,
vocal cord, jaw, etc.)vocal cord, jaw, etc.) Poor sound quality (tin can voices)Poor sound quality (tin can voices) Very low bandwidth requirementsVery low bandwidth requirements
Linear PredictionLinear Prediction
Hybrid method (of PCM and Vocoding)Hybrid method (of PCM and Vocoding) Models sound source and filter Models sound source and filter
separatelyseparately Uses original sound sample to Uses original sound sample to
calculate recreation parameters calculate recreation parameters (minimum error)(minimum error)
Low bandwidth requirementsLow bandwidth requirements Pitch and intonation independencePitch and intonation independence
Linear Prediction TheoryLinear Prediction Theory
Source-Filter modelSource-Filter model PP coefficients are calculated coefficients are calculated
SourceFilter
Linear Prediction Theory Linear Prediction Theory (cont.)(cont.)
The The aakk coefficients are found by coefficients are found by minimizing the original sound (minimizing the original sound (SStt) ) and the reconstructed sound (and the reconstructed sound (ssii).).
Can be solved using Levinson-Durbin Can be solved using Levinson-Durbin recursion.recursion.
Linear Prediction Theory Linear Prediction Theory (cont.)(cont.)
Coefficients represent the filter partCoefficients represent the filter part The filter is assumed constant for The filter is assumed constant for
small “windows” on the original small “windows” on the original sample (10-30ms windows)sample (10-30ms windows)
Each window has its own Each window has its own coefficientscoefficients
Sound source is either Pulse Train Sound source is either Pulse Train (voiced) or white noise (unvoiced)(voiced) or white noise (unvoiced)
Linear Prediction for Linear Prediction for RecognitionRecognition
Recognition on raw coefficients is Recognition on raw coefficients is poorpoor
Better to FFT the valuesBetter to FFT the values Take only first “half” of FFT’d valuesTake only first “half” of FFT’d values This is the “signature” of the soundThis is the “signature” of the sound
Sound SignaturesSound Signatures
16 values represent the sound16 values represent the sound Speaker independentSpeaker independent Unique for each phonemeUnique for each phoneme Easily recognized by machineEasily recognized by machine
Viseme ScoringViseme Scoring
Phonemes were chosen judiciouslyPhonemes were chosen judiciously Map one-to-one to visemesMap one-to-one to visemes Visemes scored independently using Visemes scored independently using
historyhistory VVii= 0.9 * V= 0.9 * Vi-1i-1 + 0.1 * {1 if matched at + 0.1 * {1 if matched at ii, ,
else 0}else 0} Ramps up and down with successive Ramps up and down with successive
matches/mismatchesmatches/mismatches
Rendering SystemRendering System Uses Alias|Wavefront’s Maya Uses Alias|Wavefront’s Maya
packagepackage Built-in support for “blend Built-in support for “blend
shapes”shapes” Mapped directly to viseme Mapped directly to viseme
scoresscores Very expressive and flexibleVery expressive and flexible
Script generated and later Script generated and later read inread in
Rendered to movie, QuickTime Rendered to movie, QuickTime used to add in original sound used to add in original sound and produce final movie.and produce final movie.
Results (Timing)Results (Timing)
Precise timing can Precise timing can be achievedbe achieved
Smoothing Smoothing introduces “lag”introduces “lag”
QuickTime™ and a H.263 decompressor are needed to see this picture.
QuickTime™ and a MPEG-4 Video decompressor are needed to see this picture.
Results (Other Examples)Results (Other Examples)
A female speaker A female speaker using male using male phoneme setphoneme set
QuickTime™ and a Cinepak decompressor are needed to see this picture.
Slower speech, male speaker
QuickTime™ and a MPEG-4 Video decompressor are needed to see this picture.
Results (Other Examples) Results (Other Examples) (cont.)(cont.)
Accented speech Accented speech with fast pacewith fast pace
QuickTime™ and a MPEG-4 Video decompressor are needed to see this picture.
Results (Summary)Results (Summary)
Good with basic speechGood with basic speech Good speaker independence (for Good speaker independence (for
normal speech)normal speech) Poor performance when speech:Poor performance when speech:
Is too fastIs too fast Is accentedIs accented Contains phonemes not in the reference Contains phonemes not in the reference
set (e.g. “w” and “th”)set (e.g. “w” and “th”)
ConclusionConclusion
Linear Prediction provides several Linear Prediction provides several benefits:benefits: Speaker independenceSpeaker independence Easy to recognize automaticallyEasy to recognize automatically
Results are reasonable, but can be Results are reasonable, but can be improvedimproved
Future WorkFuture Work
Identify best set of phonemes and Identify best set of phonemes and visemesvisemes
Phoneme classification could be Phoneme classification could be improved with better matching improved with better matching algorithm (neural net?)algorithm (neural net?)
Larger phoneme reference set for Larger phoneme reference set for more robust matchingmore robust matching
ResultsResults
Simple cases work very wellSimple cases work very well Timing is good and very responsiveTiming is good and very responsive
Robust with respect to speakerRobust with respect to speaker Cross-gender, multiple male speakersCross-gender, multiple male speakers Fails on: accents, speed, unknown Fails on: accents, speed, unknown
phonemesphonemes Problems with noisy samplesProblems with noisy samples
Can be smoothed but introduces “lag”Can be smoothed but introduces “lag”
End
Automatic Is PossibleAutomatic Is Possible
Spoken word is broken into Spoken word is broken into phonemesphonemes Phonemes are comprehensivePhonemes are comprehensive
Visemes are visual correlatesVisemes are visual correlates Used in lip-reading and traditional Used in lip-reading and traditional
animationanimation Physical speech (vocal cords, vocal Physical speech (vocal cords, vocal
tract) can be modeledtract) can be modeled Source-filter modelSource-filter model
Sound Signatures Sound Signatures (Speaker Independence)(Speaker Independence)
Sound Signatures (For Sound Signatures (For Phonemes)Phonemes)
QuickTime™ and a None decompressor are needed to see this picture.
Results (Normal Speech)Results (Normal Speech)
Normal speech, Normal speech, moderate pacemoderate pace
QuickTime™ and a Cinepak decompressor are needed to see this picture.