Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department

Toshiba Update 14/09/2005

Zeynep Inanoglu Machine Intelligence LaboratoryCU Engineering Department Supervisor: Prof. Steve Young

A Statistical Approach To Emotional

Prosody Generation


Agenda

Previous Toshiba Update A Review of Emotional Speech Synthesis Motivation for Proposed Approach Proposed Approach: Intonation

Generation from Syllable HMMs.– Intonation Models and Training – Recognition Performance of Intonation Units– Intonation Synthesis from HMMs – MLLR based Intonation Adaptation – Perceptual Tests

Summary and Future Direction


Previous Toshiba Update: A Brief Review

Emotion Recognition– Demonstrated work on HMM-based emotion detection

in voicemail messages. (Emotive Alert) – Reported the set of acoustic features that maximize

classification accuracy for each emotion type identified using sequential forward floating algorithm.

Expressive Speech Synthesis – Demonstrated the importance of prosody in emotional

expression through copy-synthesis of emotional prosody onto neutral utterances.

– Suggested the linguistically descriptive intonation units for prosody modelling. (accents, boundary tones)

Toshiba Update 14/09/2005A Review of Emotional Synthesis

The importance of prosody in emotional expression has been confirmed. (Banse& Scherer, 1996; Mozziconacci, 1998)

The available prosody rules are mainly defined for global parameters. (mean pitch, pitch range, speaking rate, declination)

Interaction of linguistic units and emotion is largely untested. (Banziger, 2005)

Strategies for emotional synthesis vary based on the type of synthesizer. – Formant Synthesis allows control over various segmental and prosodic

parameter Emotional prosody rules extracted from literature are applied by modifying neutral synthesizer parameters. (Cahn, 1990; Burkhardt, 2000; Murray&Arnott, 1995)

– Diphone Synthesis allows prosody control by defining target contours and durations based on emotional prosody rules. (Schroeder, 2004; Burkhardt, 2005)

– Unit-Selection Synthesis provides minimal parametric flexibility. Attempts at emotional expression involve recording entire unit databases for each emotion and selecting units from the appropriate database at run time. (Iida et al, 2003)

– HMM Synthesis allows spectral and prosodic control at the segmental level. Provides statistical framework for modelling emotions. (Tsuzuki et al, 2004)

Toshiba Update 14/09/2005A Review of Emotional Synthesis

Statistical

Rule-Based

Unit Replication

Segmental Global Intonational (syllable/phrase)

Formant Synthesis / Diphone Synthesis- Only as good as hand-crafted rules

- Poor to medium baseline quality

Unit-Selection Synthesis+ Very good quality

- Not scalable, too much effort

HMM Synthesis+Statistical

-Too granular for prosody modelling

Unexplored

GRANULARITY

ME

TH

OD

Toshiba Update 14/09/2005Motivation For Proposed Approach1

We propose a generative model of prosody. – We envision evaluating this prosodic model in a variety

of synthesis contexts through signal manipulation schemes such as TD-PSOLA.

Statistical – Rule based systems are only as good as their hand-

crafted rules. Why not learn rules from data? – Success of HMM methods in speech synthesis.

Syllable-based – Pitch movements are most relevant on the syllable or

intonational phrase level. However, the effects of emotion on contour shapes and linguistic units are largely unexplored.

Linguistic Units of Intonation – Coupling of emotion and linguistic phenomena has not

been investigated.

1 This work will be published in the Proceedings of ACII, October 2005, Beijing


Overview

Neutral Speech

Data

Emotion Data

1 1.5 c 1.5 1.9 a1.9 2.3 c 2.3 2.5 rb…

Syllable Boundaries

Training

Context Sensitive HMMs

F0 Generation

MLLR

Syllable Labels

Emotion HMMs

Phonetic Labels Synthesized Contour

Mean Pitch

Step 1: Train intonation models on neutral dataStep 2: Generate intonation contours from HMMStep 3: Adapt models given a small amount of emotion data

TD-PSOLA

Step 4: Transplant contour onto an utterance.Current focus is on pitch modelling only.Syllable-based intonation models.

Toshiba Update 14/09/2005Intonation Models and Training

Basic Models – Seven basic models: A (accent), C (unstressed), RB

(rising boundary), FB (falling boundary), ARB, AFB, SIL

Context-Sensitive models– Tri-unit models (Preceding and following intonation

unit) – Full-context models (Position of syllable in intonational

phrase, forward counts of accents, boundary tones in IP position of vowel in syllable, number of phones in the syllable)

– Decision tree-based parameter tying was performed for context-sensitive models.

Data: Boston Radio Corpus. Features: Normalized raw f0 and energy values as

well as differentials.


Recognition Results

Evaluation of models was performed in a recognition framework to assess how well the models represent intonation units and to quantify the benefits of incorporating context.

A held-out test set was used for predicting intonation sequences

Basic models were tested with a varying numbers of mixture components. This was compared with accuracy rates of full-context models.

Basic Label Set (7 models, 3 emitting states, N mix)

Full Context Label Set with Decision Tree Based Tying GMM

N=1GMM N=2

GMM N=4

GMM N=10

%Corr=53.26%Acc =44.52

%Corr=53.36%Acc =45.48

%Corr=54.65%Acc =46.31

%Corr=59.58%Acc =50.40

%Corr= 64.02%Acc = 55.88

Toshiba Update 14/09/2005Intonation Synthesis From HMM

The goal is to generate an optimal sequence of observations directly from syllable HMMs given the intonation models:

The optimal state sequence is predetermined by basic duration models. So parameter generation problem becomes

The solution is the sequence of mean vectors for the state sequence Qmax

We used the cepstral parameter generation algorithm of HTS system for interpolated F0 generation (Tokuda et al, 1995)

Differential F0 features (Δf and ΔΔf) are used as constraints in contour generation. Maximization is done for static parameters only.

),|(maxarg maxmaxQOPO

O

)|(),|()|( QPQOPOPallQ


A single observation vector consists of static and dynamic features:

The relationship between the static and dynamic features are as follows:

This relationship can be expressed in matrix form where O is the sequence of full feature vectors and F is the sequence of static features only. W is the matrix form of window functions. The maximization problem then becomes:

The solution is a set of equations that can be solved in a time recursive manner. (Tokuda et al, 1995)

Intonation Synthesis From HMM

],,[ Tt

Tt

Tt fffo

t

)1(

)1(

)()1(L

Ltt fwf

)2(

)2(

)()2(L

Ltt fwf

),|(maxarg maxmax QWFPFF

WFO

Toshiba Update 14/09/2005Intonation Synthesis From HMM

Neutral Speech

Data

Emotion Data

1 1.5 c 1.5 1.9 a1.9 2.3 c 2.3 2.5 rb…

Syllable Boundaries

Training

Context Sensitive HMMs

F0 Generation

MLLR

Syllable Labels

Emotion HMMs

Phonetic Labels Synthesized Contour

Mean Pitch


a a c c c c

Perceptual Effects of Intonation Units

a a c a c fb


Pitch Contour SamplesGenerated Neutral Contours Transplanted on Unseen Utterances

original

original

synthesized

synthesized

tri-unit

full-context


MLLR Adaptation to Emotional Speech

Maximum Likelihood Linear Regression (MLLR) adaptation computes a set of linear transformations for the mean and variance parameters of a continuous HMM.

The number of transforms are based on a regression tree and a threshold for what is considered “enough” adaptation data.

Adaptation data from Emotional Prosody Corpus which consists of four syllable phrases in a variety of emotions.

Happy and sad speech were chosen for this experiment.

2

4

1

3

5 6 7

Not Enough Data:Use transformation From parent node.


MLLR Adaptation To Happy & Sad Data

arb c c c cc

Neutral Sad Happy


Perceptual Tests

Test 1: How natural are neutral contours? Ten listeners were asked to rate utterances in terms of

naturalness of intonation. Some utterances were unmodified and others had

synthetic contours. A t-test (p<0.05) on the data showed that distributions of

ratings for the two hidden groups overlap sufficiently, i.e. there is no significant difference in terms of quality.


Perceptual Tests

Test 2: Does adaptation work? The goal is to find out if adapted models produce

contours that people perceive to be more emotional than the neutral contours.

Given pairs of utterances, 14 listeners were asked to identify the happier/sadder one.


Perceptual Tests

Utterances with sad contours were identified 80% of the time. This was significant. (p<0.01)

Listeners formed a bimodal distribution in their ability to detect happy utterances. Overall only 46% of the happy intonation was identified as happier than neutral. (Smiling voice is infamous in literature)

Happy models worked better with utterances with more accents and rising boundaries - the organization of labels matters!!!

Toshiba Update 14/09/2005Summary and Future Direction

A statistical approach to prosody generation was proposed with an initial focus on F0 contours.

The results of the perceptual tests were encouraging and yielded guidelines for future direction: – Bypass the use of perceptual labels. Use lexical stress

information as a prior in automatic labelling of corpora.– Investigate the role of emotion on accent frequency to come up

with a “Language Model” of emotion. – Duration Modelling: Evaluate HSMM framework as well as

duration adaptation by using vowel specific conversion functions.

– Voice Source Modelling: Treat LF parameters as part of prosody.

– Investigate the use of graphical models for allowing hierarchical constraints on generated parameters.

– Incorporate the framework into one or more TTS systems.

Date post:	14-Jan-2016
Category:	Documents
Upload:	calida
View:	20 times
Download:	0 times

Zeynep Inanoglu Machine Intelligence Laboratory CU Engineering Department

Documents