+ All Categories
Home > Documents > Speech Synthesis

Speech Synthesis

Date post: 18-Jan-2016
Category:
Upload: andrew-clay
View: 15 times
Download: 0 times
Share this document with a friend
Description:
GNU Octave
57
1 Speech synthesis
Transcript
Page 1: Speech Synthesis

1

Speech synthesis

Page 2: Speech Synthesis

2

Text-to-speech synthesis

•  The automatic transformation from (electronic) text to speech •  The ”speaker” is defined by the system design

–  Single speaker

•  In contrast to speech recognition, the aim is not to handle all speakers and all normal pronunciation variants, but to render one spoken realization of the text that is perceived as natural and intelligible

•  A text contains orthographic words, numbers, abbreviations, mnemonics and punctuation –  Linguistic analysis of the text is necessary to

•  Interpret symbols •  Analyze the grammatical structure •  Infer the semantic interpretation of the text

Speech Synthesis Torbjørn Svendsen

Page 3: Speech Synthesis

3

The processing steps of TTS

•  Text analysis: text normalization; analysis of document structure, linguistic analysis

–  Output: tagged text •  Phonemic analysis : homograph disambiguation, morphological analysis,

letter-to-sound mapping –  Output: tagged phone sequence

•  Prosodic analysis: intonation; duration; volume –  Output: control sequence, tagged phones

•  Speech synthesis: voice rendering –  Output: synthetic speech

Speech Synthesis Torbjørn Svendsen

Text analysis

Phonemic analysis

Prosodic analysis

Speech synthesis

Text Speech

Page 4: Speech Synthesis

4 Speech Synthesis Torbjørn Svendsen

Text-to-speech synthesis

•  Speech synthesis concerns the waveform generation from the annotated symbol sequence (typically phone sequence)

•  Philosophy: Rule-based vs. data driven synthesis •  Method: Articulatory synthesis; formant synthesis; concatenative

(waveform) synthesis

Text analysis

Phonemic analysis

Prosodic analysis

Speech synthesis

Page 5: Speech Synthesis

5 Speech Synthesis Torbjørn Svendsen

Quality

•  Different strategies give different quality but also different consistency of quality

•  No strategy can currently provide consistent high quality (but it is getting closer)

•  Limited domain gives high quality within the application domain

Voder

Klattalk

Infovox

Festival NextGen

Page 6: Speech Synthesis

6 Speech Synthesis Torbjørn Svendsen

The synthesis space

Cost

Bit rate

Naturalness

Intelligibility

Speech knowledge

Flexibility Processing needs

Units Vocabulary

Complexity

(Figure adapted from Granström)

Page 7: Speech Synthesis

7 Speech Synthesis Torbjørn Svendsen

Main methods

•  Formant synthesis •  Concatenative, or Waveform synthesis •  Articulatory synthesis

Page 8: Speech Synthesis

8 Speech Synthesis Torbjørn Svendsen

Formant synthesis

•  Normally rule based (knowledge driven) system, but can also be data driven

•  Each formant can be specified with center frequency, bandwidth and optionally, amplitude

•  E.g.

Rule system Formant synthesis

Formant tracks

Pitch contour phones

Annotated Synthetic speech

221)2cos(211

)( !!!! +!=

zezfezH

ii bi

bi "" "

2nd order filter with resonance in fi and a bandwidth bi

Page 9: Speech Synthesis

9 Speech Synthesis Torbjørn Svendsen

Implementation •  Cascade or parallel

implementation •  Voiced sounds typically

use cascade, unvoiced sounds use parallel implementation

•  LPC-filters can also be used –  Normally poorer

quality

Page 10: Speech Synthesis

10 Speech Synthesis Torbjørn Svendsen

Klatt formant synthesizer

Page 11: Speech Synthesis

11 Speech Synthesis Torbjørn Svendsen

LPC synthesis

•  Example: –  Original –  LPC-coded, all unvoiced/voiced –  Speed: Halved/doubled –  Pitch: Halved/doubled –  Melody

Pulse generator

Noise generator

LPC Filter

G A(z)

Page 12: Speech Synthesis

12 Speech Synthesis Torbjørn Svendsen

Rule-based formant generation

•  Formants are slowly varying –  Update rates of 5-10ms sufficient

•  Target values describe stationary conditions, {Fi, Bi} •  Rules describe transition between phones

Parameters describe transition shape

Specific rules for all transition types

Page 13: Speech Synthesis

13 Speech Synthesis Torbjørn Svendsen

Rule based formant generation

Page 14: Speech Synthesis

14 Speech Synthesis Torbjørn Svendsen

Formant synthesis

•  Flexible •  Produces intelligible speech with few parameters •  Simple implementation

–  Rule derivation is complex, development can be costly •  Limited naturalness •  NOTE: Given sufficient training data, formant generation can be data

driven, e.g. using an HMM in ”production mode” for generating the formant tracks (Acero et. al) –  Similar approach for LPC-based synthesis more recently by Tohkura

Page 15: Speech Synthesis

15 Speech Synthesis Torbjørn Svendsen

Articulatory synthesis

•  The waveform production is performed by describing the movement of the articulators –  Jaw opening, lip rounding, tongue placement and height, …. –  Acoustics, fluid mechanics form basis

•  Limited success –  Complex theory –  Computational difficulties and complexity

Page 16: Speech Synthesis

16 Speech Synthesis Torbjørn Svendsen

Synthesis by concatenation

•  Concatenation of stored waveform fragments –  Optional modification of the fragments (duration, pitch, formants)

•  Dilemma:Use of unmodified fragments will either –  Produce audible distortion at concatenation points (phase mismatch,

formant and pitch mismatch) –  Lead to an enormous database to cover all phonetic and prosodic events

•  How much modification is possible before degradation is audible?

Page 17: Speech Synthesis

17 Speech Synthesis Torbjørn Svendsen

Some central issues

1.  Which unit? 2.  How to design the acoustic ”library” (inventory)?

–  Content, recording conditions, reading style –  Annotation – type, level –  Segmentation and labeling – consistency, effort, automation?

3.  How to select the best sequence of units from the acoustic inventory 4.  How to perform prosodic modification of the selected sequence?

Page 18: Speech Synthesis

18 Speech Synthesis Torbjørn Svendsen

Some central issues

1.  Which unit? 2.  How to design the acoustic ”library” (inventory)?

–  Content, recording conditions, reading style –  Annotation – type, level –  Segmentation and labeling – consistency, effort, automation?

3.  How to select the best sequence of units from the acoustic inventory 4.  How to perform prosodic modification of the selected sequence?

Page 19: Speech Synthesis

19 Speech Synthesis Torbjørn Svendsen

1. Which unit?

•  Longer unit leads to better quality, but –  requires more data to be stored –  Is more context dependent

Page 20: Speech Synthesis

20 Speech Synthesis Torbjørn Svendsen

Unit requirements

•  Low concatenation distortion –  Longer units -> less concatenation points –  Units containing attractive concatenation points

•  Low prosodic distortion –  Small inventory means prosodic modification necessary –  Modification introduces distortion

•  Unit should be generalizable –  Need to be able to synthesize sequences that were not in the original

inventory (except for limited domain synthesis) •  Unit should be ”trainable”

–  Finite training data sufficient to estimate or predict all units

Page 21: Speech Synthesis

21 Speech Synthesis Torbjørn Svendsen

Coverage

•  Complete coverage of all phonetic and prosodic events is impossible –  Large Number of Rare Events

Page 22: Speech Synthesis

22 Speech Synthesis Torbjørn Svendsen

Some possible unit choices

•  Context independent phonemes –  Bad concatenation properties

•  Context dependent phonemes –  Reduces discontinuity problems –  Large number (~125k), needs to be reduced

•  E.g. generalized triphones or phonetic decision trees •  Diphones (dyads)

–  ~2500 possible units –  Reduces discontinuity problems –  Widespread use

•  Sub-phonemic units –  Increased use (e.g. IBM, AT&T) –  Half-phones (AT&T), phone HMM-state (IBM)

Page 23: Speech Synthesis

23 Speech Synthesis Torbjørn Svendsen

Some posible unit choices (cont.)

•  Syllables, words and phrases –  Mainly used for limited domain applications

•  Fixed message repertoire –  Potentially good quality –  Demands large storage

•  And much data collection –  Computationally demanding

•  Complex search in large database –  Syllables or demi-syllables most interesting

Page 24: Speech Synthesis

24 Speech Synthesis Torbjørn Svendsen

Some central issues

1.  Which unit? 2.  How to design the acoustic ”library” (inventory)?

–  Content, recording conditions, reading style –  Annotation – type, level –  Segmentation and labeling – consistency, effort, automation?

3.  How to select the best sequence of units from the acoustic inventory 4.  How to perform prosodic modification of the selected sequence?

Page 25: Speech Synthesis

25 Speech Synthesis Torbjørn Svendsen

Designing the acoustic inventory

•  Recordings from one speaker, appropriately annotated •  Voice talent very important for resulting quality •  Design choice: Rely on prosodic modification by signal processing or

aim for good coverage of natural prosodic variation in database •  Prosodic modification at synthesis – PSOLA type synthesis

–  Typically diphone units –  Normally desirable to have nearly constant (neutral) F0 –  Nonsense words/sentences with (near) full diphone coverage –  Small database (~5 minutes of speech contains the essential units)

Page 26: Speech Synthesis

26 Speech Synthesis Torbjørn Svendsen

Designing the acoustic inventory (2) •  Unit selection synthesis – rely on natural prosodic variation

–  Representative speech – speaking style defined by database –  Many representations of each phonetic unit

•  Gives prosodic variation –  Large database –  Facitilitates longer units, variable units –  Requires search for the best unit sequence –  Rich phonetic and prosodic context

•  Typically ”real” texts –  Text selection:

•  Start with large number of natural sentences •  Analyze sentences, predict phonetic and prosodic realization •  Use some greedy algorithm to obtain the best coverage possible with a small

number of sentences (2000-4000, typically) •  Design supplementary sentences to improve coverage

Page 27: Speech Synthesis

27 Speech Synthesis Torbjørn Svendsen

Coverage - LNRE

•  Large number of units with small probability of occurrence •  If database units are selected randomly, the probability of

encountering a unit not in the database approaches certainty for a small sequence of randomly selected sentences.

•  Unit inventory must be chosen with care •  Fall-back solutions must exist for non-covered units

P(unit)

Page 28: Speech Synthesis

28 Speech Synthesis Torbjørn Svendsen

Annotation

•  For small databases, speech can be segmented and annotated manually –  Phonemic and prosodic annotation can be detailed

•  For unit selection databases automation is necessary –  Automatic or semi-automatic methods for segmentation in phonemic and

prosodic units –  Annotation can be fairly high-level without loss of quality –  Annotation level and cost function for unit selection are closely linked

Page 29: Speech Synthesis

29 Speech Synthesis Torbjørn Svendsen

Some central issues

1.  Which unit? 2.  How to design the acoustic ”library” (inventory)?

–  Content, recording conditions, reading style –  Annotation – type, level –  Segmentation and labeling – consistency, effort, automation?

3.  How to select the best sequence of units from the acoustic inventory 4.  How to perform prosodic modification of the selected sequence?

Page 30: Speech Synthesis

30 Speech Synthesis Torbjørn Svendsen

3. Optimal unit string

•  Selection problem arises when there are several possible choices for the unit sequence

•  Traditional diphone synthesis has only one exemplar of each unit –  Trivial solution

•  Selection is made based on desire for naturalness and minimum discontinuity due to –  Different phonetic contexts –  Segmentation errors in the database –  Acoustic variability –  Prosodic differences (pitch discontinuity, formant tracks)

•  Search problem –  Must define an object function to be minimized

Page 31: Speech Synthesis

31 Speech Synthesis Torbjørn Svendsen

Object function for search

d(!,T) = du(! j ,t j )j=1

N

" + dt (! j ,! j+1)j=1

N #1

"

! = {!1,!2,...,!N} # Candidate segment sequenceT = {t1,t2,...,tN} - Target units

ˆ ! = argmin!

d(!,T)

Unit cost, du(*) Transition cost, dt(*)

Lattice of candidate units

Sequence of target units

Page 32: Speech Synthesis

32 Speech Synthesis Torbjørn Svendsen

Object function for search (2)

•  How to choose unit and transition cost functions? –  Empirical or data driven

•  Empirical strategy: –  Transition cost:

•  If two segments originally spoken in succession, dt(*)=0 •  Otherwise, cost as sum of prosodic and coarticulary cost •  Prosodic cost proportional to difference in F0 (or logF0) at boundary •  Coarticulary cost based on empirical knowledge of perceived distance

–  Unit cost •  Contribution from prosody and context •  Prosody cost proportional to difference in F0 •  Contextual cost by using a unit from a different phonetic context. Based on

empirical data.

Page 33: Speech Synthesis

33 Speech Synthesis Torbjørn Svendsen

Object function for search (3)

•  Data driven cost function –  Transition cost

•  Measure of spectral discontinuity, e.g. spectral distance in the transition area (distance between the end frame of preceeding unit and first frame of succeding unit)

•  (Optional) prosodic cost, e.g. magnitude of log(F0) difference –  Unit cost

•  Based on context •  Examples:

–  Same context means no cost, different context gives infinite cost –  Generalized triphones(GT): Unit belongs to same GT means no cost,

otherwise cost is infinity –  Phonetic decision trees, e.g. no cost for units at same leaf node

Page 34: Speech Synthesis

34 Speech Synthesis Torbjørn Svendsen

Optimal unit string selection

•  Given –  The object function to be minimized –  A target sequence from the TTS front end –  A unit inventory

•  The minimization can be performed using standard dynamic programming techniques (Viterbi-style)

•  Similar to HMM decoding, but no probabilities, only cost values •  Search can be further simplified by e.g. clustering of units

–  Initial search using units representing each cluster –  Search refinement by selecting best cluster member as selected unit

Page 35: Speech Synthesis

35 Speech Synthesis Torbjørn Svendsen

Some central issues

1.  Which unit? 2.  How to design the acoustic ”library” (inventory)?

–  Content, recording conditions, reading style –  Annotation – type, level –  Segmentation and labeling – consistency, effort, automation?

3.  How to select the best sequence of units from the acoustic inventory 4.  How to perform prosodic modification of the selected sequence?

Page 36: Speech Synthesis

36 Speech Synthesis Torbjørn Svendsen

4. Prosodic modification

•  Techniques for prosodic modification (pitch, duration) mandatory when unit inventory is small

•  Also desirable for unit selection synthesis due to LNRE •  Main issue: How to be able to achieve (at least moderate) prosodic

modification of a unit (sequence) without introducing annoying distortion

•  Example 1: Original - duration - pitch - duration and pitch

•  Example: Original – duration – pitch – duration and pitch

Page 37: Speech Synthesis

37 Speech Synthesis Torbjørn Svendsen

(Synchronous) Overlap and Add

•  OLA: Time-scale modification, fixed distance between analysis windows. Produces irregular pitch periods

•  SOLA: Analysis window placed at position which gives max correlation btw. windows

2N

N

Page 38: Speech Synthesis

38 Speech Synthesis Torbjørn Svendsen

Pitch Synchronous OLA (PSOLA)

•  Window is pitch synchronous –  centered around an excitation pulse –  Duration equal to two pitch periods, 2*T0

•  Allows for simple modification of pitch frequency •  Can also modify duration •  Unvoiced sounds:

–  Fixed window length (< 10 ms) –  Can invert every other repeated segment in order to avoid periodicities

when expanding duration •  Can provide high quality as long as the degree of modification is

relatively low (<2)

Page 39: Speech Synthesis

39 Speech Synthesis Torbjørn Svendsen

PSOLA – principle, F0 change T=1.25*T0 Original Epochs Shift Re-harmonized

signal

Page 40: Speech Synthesis

40 Speech Synthesis Torbjørn Svendsen

PSOLA – duration and F0 modification

Page 41: Speech Synthesis

41 Speech Synthesis Torbjørn Svendsen

PSOLA principle

!"

#"=

#=k

kTnne )()( 0$

s(n) !"

#"=

#==k

kTnsnsnenx )()(*)()( 0

•  s(n) determines spectral envelope

•  Using an appropriate, pitch synchronous window, T0 can be changed without changing spectral envelope (exact match at f=k*F0)

•  Window type and degree of change will determine distortion outside pitch harmonics (interpolated values, correctness determined by window sidelobes)

Page 42: Speech Synthesis

42 Speech Synthesis Torbjørn Svendsen

How to determine the synthesis epochs

•  ts(j) – time instance for pitch pulse (epoch) i in the synthesis •  Ps(t) – desired pitch period at time t •  If Ps(t) is slowly varying, ts(j+1)- ts(j)=Ps (ts(j)) •  Exact:

)()1(

)()()1(

)1(

)(

jtjt

dttPjtjt

ss

jt

jts

ss

s

s

!+=!+

"+

–  Next pulse offset by mean pitch within the synthesis interval –  Iterative calculation

Page 43: Speech Synthesis

43 Speech Synthesis Torbjørn Svendsen

Synthesis epoch calculation

Page 44: Speech Synthesis

44 Speech Synthesis Torbjørn Svendsen

Pitch modification, no time scaling

linearor constant normally is )(constant piecewise is )(

)()1(

)()()()1(

)()()( :pitch Desired)()1()( :pitch Original

)1(

)(

ttP

jtjt

dttPtjtjt

tPttPitittP

a

ss

jt

jta

ss

as

aaa

s

s

!

!

!

"+="+

="+=

#+

Page 45: Speech Synthesis

45 Speech Synthesis Torbjørn Svendsen

Changing the time scale ts = D(ta ) = ! (" )d

0

ta

! "

Presume ! (" ) =! (reduce speed when ! >1)

Similar derivation to pitch epoch determination

ts ( j +1)" ts ( j) =! Pa (t)dt

ts ( j )/!

ts ( j+1)/!

!ts ( j +1)" ts ( j)

If Pa (t) # Pa (constant in interval):

ts ( j +1)" ts ( j) =!Pa

Page 46: Speech Synthesis

46 Speech Synthesis Torbjørn Svendsen

Changing the time scale

Page 47: Speech Synthesis

47 Speech Synthesis Torbjørn Svendsen

All the modifications

)()1(

)()(

)()1(

:pitch and both time Changing/)1(

/)(

jtjt

dttPtjtjt

ss

jt

jta

ss

s

s

!+=!+

"+ #

#

$#

Page 48: Speech Synthesis

48 Speech Synthesis Torbjørn Svendsen

Epoch positioning in database

•  Database must be annotated with pitch pulse location •  Accurate positioning necessary for good performance •  Automatic methods using pitch estimation techniques give reasonably

good results •  Use of laryngograph (electroglottograph – EGG) during recording is

recommended –  Measures resistance over vocal cords – dependent on glottal opening –  Peak picking of derivative of EGG signal

Page 49: Speech Synthesis

49 Speech Synthesis Torbjørn Svendsen

Epochs from EGG signal

•  Peak picking on EGG or its time derivative •  Accurate epoch and F0 estimation •  Voiced/unvoiced determination

Speech signal

EGG signal

Detected pulse locations

Page 50: Speech Synthesis

50 Speech Synthesis Torbjørn Svendsen

PSOLA limitations

•  Amplitude mismatch •  Voiced fricatives

–  Increased buzzyness •  All modification will introduce distortion and unnaturalness

–  Degree dependent on amount of modification •  Limits on maximum modification

Page 51: Speech Synthesis

51 Speech Synthesis Torbjørn Svendsen

Phase mismatches

•  Wrong positioning of pitch pulses in database

•  Causes glitches in output

Page 52: Speech Synthesis

52 Speech Synthesis Torbjørn Svendsen

Pitch mismatches

•  Correct F0 and pulse position

•  Different F0 in segments cause spectral and waveform discontinuities

Page 53: Speech Synthesis

53

HMMs for synthesis

•  In speech recognition, Hidden Markov models are used to model speech production

–  Task of recognizer is to find the model that best explains the observed utterance •  If the HMM is used for generating observations, the produced feature vector

sequence can be used to produce speech from a given unit sequence (phone sequence)

–  The feature vectors must be suitable for speech production –  Combination of continuous and discrete elements –  Modifications to HMM theory are necessary to facilitate the generative mode –  Potential for efficient and flexible synthesis

•  Basis for HMM-based synthesis

Page 54: Speech Synthesis

54

HMM-based speech synthesis

•  Training from database •  Produce excitation and filter parameters for e.g. LPC-type speech generation

Page 55: Speech Synthesis

55

Text-to-speech synthesis, a hybrid solution •  Speech training

database •  HMM based system for

prediction and unit selection

•  Experimental system •  Very good evaluation in

international competition (Blizzard Challenge 2010)

TRAINING DATA INPUT TEXT

HTS

training

Analysis

TTS frontend

Target model

construction

Candidate list

construction

Selection &

boundary decision

Waveform

concatenation

State

alignment

Voice

database

Page 56: Speech Synthesis

56 Speech Synthesis Torbjørn Svendsen

A few examples

•  Diphone synthesis: Festival, Arne, Infovox

•  Unit selection synthesis: Festival, AT&T NextGen

•  HMM synthesis

•  Hybrid HMM/Unit selection

•  Limited domain unit selection synthesis

Page 57: Speech Synthesis

57 Speech Synthesis Torbjørn Svendsen

Summary

•  Data-driven vs synthesis by rule. •  Current synthesis generation is concatenative – waveform synthesis. •  Single unit synthesis, “diphone synthesis”, requires units to be

prosodically modified. •  Unit selection synthesis aims to use natural prosody and minimal

prosodic modification. •  Issues in waveform synthesis:

–  Unit definition. –  Definition, realization and annotation of waveform library. –  Unit selection - search. –  Prosodic modification.


Recommended