Download - Production of Emotional Speech

7/31/2019 Production of Emotional Speech

1/41

Gabriel Schubiner


2/41

Generation of Affect inSynthesized Speech

Corpus-based approach to

synthesisExpressive visual speech usingtalking head

Demos

Affect Editor Quiz/Demo

Synface Demo


3/41

Affect in SpeechGoals

Addition of Emotion to Synthetic speech

Acoustic ModelTypology of parameters of emotionalspeech

Quantification

Addresses problem of expressiveness

What benefit is gained from expressive

speech?


4/41

mo on eoryAssumptionsAssumptionsEmotion -> Nervous System ->

Speech Output

Binary distinction

Parasympathetic vsSympathetic

based on physical changes

universal emotions


5/41

Approaches to

AffectGenerative

Emotion -> Physical ->Acoustic

Descriptive

Observed acoustic paramsimposed


6/41

Descriptive

Framework4 Parameter groupsPitch

Timing

Voice Quality

Articulation

Assumption of independence

How could this affect design and

results?


7/41

Pitch

TimingAccent ShapeAverage Pitch

Contour Slope

Final Lowering

Pitch Range

Reference Line

Exaggeration (not used)


8/41

Voice Quality

ArticulationBreathiness

Brilliance

Loudness

Pause Discontinuity

Pitch Discontinuity

Tremor

Laryngealization


9/41

Implementation

Each parameter has scale

Each scale is independent

from other parameters

between positive andnegative


10/41

Implementation

Settings grouped into presetconditions for each emotion

based on prior studies


11/41

Program Flow:

InputEmotion -> parameterrepresentationUtterance -> clauses

Agent, Action, Object, Locative

Clause and lexeme annotations

Finds all possible locations ofaffect and chooses whether ornot to use


12/41

Program Flow

Utterance -> Tree structure ->

linear phonologycompiled for specificsynthesizer with software to

simulate affects not available inhardware


13/41


14/41

Perception

30 Utterances

5 sentences * 6 affects

Forced choice of one of sixaffects

magnitude and comments


15/41

Elicitation

SentencesIntroIm almost finished

Im going to the city

I saw your name in the paper X

I thought you really meant it

Look at that picture


16/41

PopQuiz!!!


17/41

Pop Quiz SolutionsIm almost finished

Disgust : Surprise : Sadness : Gladness :Anger : Fear

Im going to the citySurprise : Gladness : Anger : Disgust :Sadness : Fear

I thought you really meant it

Anger : Disgust : Gladness : Sadness : Fear :Surprise

Look at that picture

Anger : Fear : Disgust : Sadness : Gladness :Sur rise


18/41

Results

approx 50% recognition rate

91% sadness


19/41


20/41

Conclusions

Effective?

Thoughts?


21/41

Corpus-basedApproach to

Expressive SpeechSynthesis


22/41

Corpus

Collect utterances in each

emotionemotion-dependent semantics

One speaker

Good news, Bad news, Question


23/41

Model: Feature

VectorFeaturesLexical stress

Phrase-level stressDistance from beginning of phrase

Distance from end of phrase

POSPhrase-type

End of syllable pitch


24/41

Model:

ClassificationPredicts F05 syllable window

Uses feature vector to predictobservation vector

observation vector: log(p),p

p = end of syllable pitch

Decision Tree


25/41

Model: Target

DurationSimilar to predicting F0

build tree with goal of providingGaussian at leafs

Use mean of class as target

duration

discretization


26/41

ModelsUses acoustic analogue of n-grams

captures sense of context

compared to describing full

emotion as sequencecompare to Affect Editor

Uses only F0 and length (comp.

A E)Include information about fromwhich utterance the featuresare derived

intentional bias, justified?


27/41

Model: SynthesisData tagged with originalexpression and emotion

expression-cost matrix

noted trade-off:

emotional intensity vs.smoothness

Paralinguistic events


28/41

SSML

Compare to Cahns typology

Abstraction layers


29/41

Perception

Experiment

Distinguish same utterancespoken with neutral andaffected prosody

Semantic content problematic?


30/41

Results

Binary decision

Reasonable gainover baseline?


31/41

Conclusion

Major contributions?

Paths forward?

S th i f E i


32/41

Synthesis of ExpressiveVisual Speech on a

Talking Head


33/41

< Not theseTalking Heads...


34/41

Synthesis

BackgroundManipulation of video images

Virtual model with deformationparametersSynchronized with time-alignedtranscription

Articulatory Control ModelCohen & Massaro (1993)


35/41

Data

Single actor

Given specific emotion asinstruction

6 emotions + neutral


36/41

Facial Animation

ParametersFace independent

FAP Matrix * scaling factor +position0

Weighted deformations of

distance between vertices andfeature point


37/41

Modeling

Phonetic segments assigned

target parameter vectortemporal blending overdominance functions

Principal components


38/41

ML

Separate models for each

emotion

6:1 training:testing ratio

models -> PC traj -> FAP traj *emotion param matrix


39/41

Results

More extreme emotions easierto perceive

73% sad, 60% angry, 40% sad


40/41

Synface Demo


41/41

Discussion

Changes in approach from Cahnto Eide

Production compared to

Detection