Speech Synthesis

1

Speech synthesis

2

Text-to-speech synthesis

•  The automatic transformation from (electronic) text to speech •  The ”speaker” is defined by the system design

–  Single speaker

•  In contrast to speech recognition, the aim is not to handle all speakers and all normal pronunciation variants, but to render one spoken realization of the text that is perceived as natural and intelligible

•  A text contains orthographic words, numbers, abbreviations, mnemonics and punctuation –  Linguistic analysis of the text is necessary to

•  Interpret symbols •  Analyze the grammatical structure •  Infer the semantic interpretation of the text

Speech Synthesis Torbjørn Svendsen

3

The processing steps of TTS

•  Text analysis: text normalization; analysis of document structure, linguistic analysis

–  Output: tagged text •  Phonemic analysis : homograph disambiguation, morphological analysis,

letter-to-sound mapping –  Output: tagged phone sequence

•  Prosodic analysis: intonation; duration; volume –  Output: control sequence, tagged phones

•  Speech synthesis: voice rendering –  Output: synthetic speech

Speech Synthesis Torbjørn Svendsen

Text analysis

Phonemic analysis

Prosodic analysis

Speech synthesis

Text Speech

4 Speech Synthesis Torbjørn Svendsen

Text-to-speech synthesis

•  Speech synthesis concerns the waveform generation from the annotated symbol sequence (typically phone sequence)

•  Philosophy: Rule-based vs. data driven synthesis •  Method: Articulatory synthesis; formant synthesis; concatenative

(waveform) synthesis

Text analysis

Phonemic analysis

Prosodic analysis

Speech synthesis


Quality

•  Different strategies give different quality but also different consistency of quality

•  No strategy can currently provide consistent high quality (but it is getting closer)

•  Limited domain gives high quality within the application domain

Voder

Klattalk

Infovox

Festival NextGen


The synthesis space

Cost

Bit rate

Naturalness

Intelligibility

Speech knowledge

Flexibility Processing needs

Units Vocabulary

Complexity

(Figure adapted from Granström)


Main methods

•  Formant synthesis •  Concatenative, or Waveform synthesis •  Articulatory synthesis


Formant synthesis

•  Normally rule based (knowledge driven) system, but can also be data driven

•  Each formant can be specified with center frequency, bandwidth and optionally, amplitude

•  E.g.

Rule system Formant synthesis

Formant tracks

Pitch contour phones

Annotated Synthetic speech

221)2cos(211

)( !!!! +!=

zezfezH

ii bi

bi "" "

2nd order filter with resonance in fi and a bandwidth bi


Implementation •  Cascade or parallel

implementation •  Voiced sounds typically

use cascade, unvoiced sounds use parallel implementation

•  LPC-filters can also be used –  Normally poorer

quality


Klatt formant synthesizer


LPC synthesis

•  Example: –  Original –  LPC-coded, all unvoiced/voiced –  Speed: Halved/doubled –  Pitch: Halved/doubled –  Melody

Pulse generator

Noise generator

LPC Filter

G A(z)


Rule-based formant generation

•  Formants are slowly varying –  Update rates of 5-10ms sufficient

•  Target values describe stationary conditions, {Fi, Bi} •  Rules describe transition between phones

Parameters describe transition shape

Specific rules for all transition types


Rule based formant generation


Formant synthesis

•  Flexible •  Produces intelligible speech with few parameters •  Simple implementation

–  Rule derivation is complex, development can be costly •  Limited naturalness •  NOTE: Given sufficient training data, formant generation can be data

driven, e.g. using an HMM in ”production mode” for generating the formant tracks (Acero et. al) –  Similar approach for LPC-based synthesis more recently by Tohkura


Articulatory synthesis

•  The waveform production is performed by describing the movement of the articulators –  Jaw opening, lip rounding, tongue placement and height, …. –  Acoustics, fluid mechanics form basis

•  Limited success –  Complex theory –  Computational difficulties and complexity


Synthesis by concatenation

•  Concatenation of stored waveform fragments –  Optional modification of the fragments (duration, pitch, formants)

•  Dilemma:Use of unmodified fragments will either –  Produce audible distortion at concatenation points (phase mismatch,

formant and pitch mismatch) –  Lead to an enormous database to cover all phonetic and prosodic events

•  How much modification is possible before degradation is audible?


Some central issues

1.  Which unit? 2.  How to design the acoustic ”library” (inventory)?

–  Content, recording conditions, reading style –  Annotation – type, level –  Segmentation and labeling – consistency, effort, automation?

3.  How to select the best sequence of units from the acoustic inventory 4.  How to perform prosodic modification of the selected sequence?


Some central issues





1. Which unit?

•  Longer unit leads to better quality, but –  requires more data to be stored –  Is more context dependent


Unit requirements

•  Low concatenation distortion –  Longer units -> less concatenation points –  Units containing attractive concatenation points

•  Low prosodic distortion –  Small inventory means prosodic modification necessary –  Modification introduces distortion

•  Unit should be generalizable –  Need to be able to synthesize sequences that were not in the original

inventory (except for limited domain synthesis) •  Unit should be ”trainable”

–  Finite training data sufficient to estimate or predict all units


Coverage

•  Complete coverage of all phonetic and prosodic events is impossible –  Large Number of Rare Events


Some possible unit choices

•  Context independent phonemes –  Bad concatenation properties

•  Context dependent phonemes –  Reduces discontinuity problems –  Large number (~125k), needs to be reduced

•  E.g. generalized triphones or phonetic decision trees •  Diphones (dyads)

–  ~2500 possible units –  Reduces discontinuity problems –  Widespread use

•  Sub-phonemic units –  Increased use (e.g. IBM, AT&T) –  Half-phones (AT&T), phone HMM-state (IBM)


Some posible unit choices (cont.)

•  Syllables, words and phrases –  Mainly used for limited domain applications

•  Fixed message repertoire –  Potentially good quality –  Demands large storage

•  And much data collection –  Computationally demanding

•  Complex search in large database –  Syllables or demi-syllables most interesting


Some central issues





Designing the acoustic inventory

•  Recordings from one speaker, appropriately annotated •  Voice talent very important for resulting quality •  Design choice: Rely on prosodic modification by signal processing or

aim for good coverage of natural prosodic variation in database •  Prosodic modification at synthesis – PSOLA type synthesis

–  Typically diphone units –  Normally desirable to have nearly constant (neutral) F0 –  Nonsense words/sentences with (near) full diphone coverage –  Small database (~5 minutes of speech contains the essential units)


Designing the acoustic inventory (2) •  Unit selection synthesis – rely on natural prosodic variation

–  Representative speech – speaking style defined by database –  Many representations of each phonetic unit

•  Gives prosodic variation –  Large database –  Facitilitates longer units, variable units –  Requires search for the best unit sequence –  Rich phonetic and prosodic context

•  Typically ”real” texts –  Text selection:

•  Start with large number of natural sentences •  Analyze sentences, predict phonetic and prosodic realization •  Use some greedy algorithm to obtain the best coverage possible with a small

number of sentences (2000-4000, typically) •  Design supplementary sentences to improve coverage


Coverage - LNRE

•  Large number of units with small probability of occurrence •  If database units are selected randomly, the probability of

encountering a unit not in the database approaches certainty for a small sequence of randomly selected sentences.

•  Unit inventory must be chosen with care •  Fall-back solutions must exist for non-covered units

P(unit)


Annotation

•  For small databases, speech can be segmented and annotated manually –  Phonemic and prosodic annotation can be detailed

•  For unit selection databases automation is necessary –  Automatic or semi-automatic methods for segmentation in phonemic and

prosodic units –  Annotation can be fairly high-level without loss of quality –  Annotation level and cost function for unit selection are closely linked


Some central issues





3. Optimal unit string

•  Selection problem arises when there are several possible choices for the unit sequence

•  Traditional diphone synthesis has only one exemplar of each unit –  Trivial solution

•  Selection is made based on desire for naturalness and minimum discontinuity due to –  Different phonetic contexts –  Segmentation errors in the database –  Acoustic variability –  Prosodic differences (pitch discontinuity, formant tracks)

•  Search problem –  Must define an object function to be minimized


Object function for search

d(!,T) = du(! j ,t j )j=1

N

" + dt (! j ,! j+1)j=1

N #1

"

! = {!1,!2,...,!N} # Candidate segment sequenceT = {t1,t2,...,tN} - Target units

ˆ ! = argmin!

d(!,T)

Unit cost, du(*) Transition cost, dt(*)

Lattice of candidate units

Sequence of target units


Object function for search (2)

•  How to choose unit and transition cost functions? –  Empirical or data driven

•  Empirical strategy: –  Transition cost:

•  If two segments originally spoken in succession, dt(*)=0 •  Otherwise, cost as sum of prosodic and coarticulary cost •  Prosodic cost proportional to difference in F0 (or logF0) at boundary •  Coarticulary cost based on empirical knowledge of perceived distance

–  Unit cost •  Contribution from prosody and context •  Prosody cost proportional to difference in F0 •  Contextual cost by using a unit from a different phonetic context. Based on

empirical data.


Object function for search (3)

•  Data driven cost function –  Transition cost

•  Measure of spectral discontinuity, e.g. spectral distance in the transition area (distance between the end frame of preceeding unit and first frame of succeding unit)

•  (Optional) prosodic cost, e.g. magnitude of log(F0) difference –  Unit cost

•  Based on context •  Examples:

–  Same context means no cost, different context gives infinite cost –  Generalized triphones(GT): Unit belongs to same GT means no cost,

otherwise cost is infinity –  Phonetic decision trees, e.g. no cost for units at same leaf node


Optimal unit string selection

•  Given –  The object function to be minimized –  A target sequence from the TTS front end –  A unit inventory

•  The minimization can be performed using standard dynamic programming techniques (Viterbi-style)

•  Similar to HMM decoding, but no probabilities, only cost values •  Search can be further simplified by e.g. clustering of units

–  Initial search using units representing each cluster –  Search refinement by selecting best cluster member as selected unit


Some central issues





4. Prosodic modification

•  Techniques for prosodic modification (pitch, duration) mandatory when unit inventory is small

•  Also desirable for unit selection synthesis due to LNRE •  Main issue: How to be able to achieve (at least moderate) prosodic

modification of a unit (sequence) without introducing annoying distortion

•  Example 1: Original - duration - pitch - duration and pitch

•  Example: Original – duration – pitch – duration and pitch


(Synchronous) Overlap and Add

•  OLA: Time-scale modification, fixed distance between analysis windows. Produces irregular pitch periods

•  SOLA: Analysis window placed at position which gives max correlation btw. windows

2N

N


Pitch Synchronous OLA (PSOLA)

•  Window is pitch synchronous –  centered around an excitation pulse –  Duration equal to two pitch periods, 2*T0

•  Allows for simple modification of pitch frequency •  Can also modify duration •  Unvoiced sounds:

–  Fixed window length (< 10 ms) –  Can invert every other repeated segment in order to avoid periodicities

when expanding duration •  Can provide high quality as long as the degree of modification is

relatively low (<2)


PSOLA – principle, F0 change T=1.25*T0 Original Epochs Shift Re-harmonized

signal


PSOLA – duration and F0 modification


PSOLA principle

!"

#"=

#=k

kTnne )()( 0$

s(n) !"

#"=

#==k

kTnsnsnenx )()(*)()( 0

•  s(n) determines spectral envelope

•  Using an appropriate, pitch synchronous window, T0 can be changed without changing spectral envelope (exact match at f=k*F0)

•  Window type and degree of change will determine distortion outside pitch harmonics (interpolated values, correctness determined by window sidelobes)


How to determine the synthesis epochs

•  ts(j) – time instance for pitch pulse (epoch) i in the synthesis •  Ps(t) – desired pitch period at time t •  If Ps(t) is slowly varying, ts(j+1)- ts(j)=Ps (ts(j)) •  Exact:

)()1(

)()()1(

)1(

)(

jtjt

dttPjtjt

ss

jt

jts

ss

s

s

!+=!+

"+

–  Next pulse offset by mean pitch within the synthesis interval –  Iterative calculation


Synthesis epoch calculation


Pitch modification, no time scaling

linearor constant normally is )(constant piecewise is )(

)()1(

)()()()1(

)()()( :pitch Desired)()1()( :pitch Original

)1(

)(

ttP

jtjt

dttPtjtjt

tPttPitittP

a

ss

jt

jta

ss

as

aaa

s

s

!

!

!

"+="+

="+=

#+


Changing the time scale ts = D(ta ) = ! (" )d

0

ta

! "

Presume ! (" ) =! (reduce speed when ! >1)

Similar derivation to pitch epoch determination

ts ( j +1)" ts ( j) =! Pa (t)dt

ts ( j )/!

ts ( j+1)/!

!ts ( j +1)" ts ( j)

If Pa (t) # Pa (constant in interval):

ts ( j +1)" ts ( j) =!Pa


Changing the time scale


All the modifications

)()1(

)()(

)()1(

:pitch and both time Changing/)1(

/)(

jtjt

dttPtjtjt

ss

jt

jta

ss

s

s

!+=!+

"+ #

#

$#


Epoch positioning in database

•  Database must be annotated with pitch pulse location •  Accurate positioning necessary for good performance •  Automatic methods using pitch estimation techniques give reasonably

good results •  Use of laryngograph (electroglottograph – EGG) during recording is

recommended –  Measures resistance over vocal cords – dependent on glottal opening –  Peak picking of derivative of EGG signal


Epochs from EGG signal

•  Peak picking on EGG or its time derivative •  Accurate epoch and F0 estimation •  Voiced/unvoiced determination

Speech signal

EGG signal

Detected pulse locations


PSOLA limitations

•  Amplitude mismatch •  Voiced fricatives

–  Increased buzzyness •  All modification will introduce distortion and unnaturalness

–  Degree dependent on amount of modification •  Limits on maximum modification


Phase mismatches

•  Wrong positioning of pitch pulses in database

•  Causes glitches in output


Pitch mismatches

•  Correct F0 and pulse position

•  Different F0 in segments cause spectral and waveform discontinuities

53

HMMs for synthesis

•  In speech recognition, Hidden Markov models are used to model speech production

–  Task of recognizer is to find the model that best explains the observed utterance •  If the HMM is used for generating observations, the produced feature vector

sequence can be used to produce speech from a given unit sequence (phone sequence)

–  The feature vectors must be suitable for speech production –  Combination of continuous and discrete elements –  Modifications to HMM theory are necessary to facilitate the generative mode –  Potential for efficient and flexible synthesis

•  Basis for HMM-based synthesis

54

HMM-based speech synthesis

•  Training from database •  Produce excitation and filter parameters for e.g. LPC-type speech generation

55

Text-to-speech synthesis, a hybrid solution •  Speech training

database •  HMM based system for

prediction and unit selection

•  Experimental system •  Very good evaluation in

international competition (Blizzard Challenge 2010)

TRAINING DATA INPUT TEXT

HTS

training

Analysis

TTS frontend

Target model

construction

Candidate list

construction

Selection &

boundary decision

Waveform

concatenation

State

alignment

Voice

database


A few examples

•  Diphone synthesis: Festival, Arne, Infovox

•  Unit selection synthesis: Festival, AT&T NextGen

•  HMM synthesis

•  Hybrid HMM/Unit selection

•  Limited domain unit selection synthesis


Summary

•  Data-driven vs synthesis by rule. •  Current synthesis generation is concatenative – waveform synthesis. •  Single unit synthesis, “diphone synthesis”, requires units to be

prosodically modified. •  Unit selection synthesis aims to use natural prosody and minimal

prosodic modification. •  Issues in waveform synthesis:

–  Unit definition. –  Definition, realization and annotation of waveform library. –  Unit selection - search. –  Prosodic modification.

Date post:	18-Jan-2016
Category:	Documents
Upload:	andrew-clay
View:	15 times
Download:	0 times

Speech Synthesis

Documents