Date post: | 18-Jan-2016 |
Category: |
Documents |
Upload: | andrew-clay |
View: | 15 times |
Download: | 0 times |
1
Speech synthesis
2
Text-to-speech synthesis
• The automatic transformation from (electronic) text to speech • The ”speaker” is defined by the system design
– Single speaker
• In contrast to speech recognition, the aim is not to handle all speakers and all normal pronunciation variants, but to render one spoken realization of the text that is perceived as natural and intelligible
• A text contains orthographic words, numbers, abbreviations, mnemonics and punctuation – Linguistic analysis of the text is necessary to
• Interpret symbols • Analyze the grammatical structure • Infer the semantic interpretation of the text
Speech Synthesis Torbjørn Svendsen
3
The processing steps of TTS
• Text analysis: text normalization; analysis of document structure, linguistic analysis
– Output: tagged text • Phonemic analysis : homograph disambiguation, morphological analysis,
letter-to-sound mapping – Output: tagged phone sequence
• Prosodic analysis: intonation; duration; volume – Output: control sequence, tagged phones
• Speech synthesis: voice rendering – Output: synthetic speech
Speech Synthesis Torbjørn Svendsen
Text analysis
Phonemic analysis
Prosodic analysis
Speech synthesis
Text Speech
4 Speech Synthesis Torbjørn Svendsen
Text-to-speech synthesis
• Speech synthesis concerns the waveform generation from the annotated symbol sequence (typically phone sequence)
• Philosophy: Rule-based vs. data driven synthesis • Method: Articulatory synthesis; formant synthesis; concatenative
(waveform) synthesis
Text analysis
Phonemic analysis
Prosodic analysis
Speech synthesis
5 Speech Synthesis Torbjørn Svendsen
Quality
• Different strategies give different quality but also different consistency of quality
• No strategy can currently provide consistent high quality (but it is getting closer)
• Limited domain gives high quality within the application domain
Voder
Klattalk
Infovox
Festival NextGen
6 Speech Synthesis Torbjørn Svendsen
The synthesis space
Cost
Bit rate
Naturalness
Intelligibility
Speech knowledge
Flexibility Processing needs
Units Vocabulary
Complexity
(Figure adapted from Granström)
7 Speech Synthesis Torbjørn Svendsen
Main methods
• Formant synthesis • Concatenative, or Waveform synthesis • Articulatory synthesis
8 Speech Synthesis Torbjørn Svendsen
Formant synthesis
• Normally rule based (knowledge driven) system, but can also be data driven
• Each formant can be specified with center frequency, bandwidth and optionally, amplitude
• E.g.
Rule system Formant synthesis
Formant tracks
Pitch contour phones
Annotated Synthetic speech
221)2cos(211
)( !!!! +!=
zezfezH
ii bi
bi "" "
2nd order filter with resonance in fi and a bandwidth bi
9 Speech Synthesis Torbjørn Svendsen
Implementation • Cascade or parallel
implementation • Voiced sounds typically
use cascade, unvoiced sounds use parallel implementation
• LPC-filters can also be used – Normally poorer
quality
10 Speech Synthesis Torbjørn Svendsen
Klatt formant synthesizer
11 Speech Synthesis Torbjørn Svendsen
LPC synthesis
• Example: – Original – LPC-coded, all unvoiced/voiced – Speed: Halved/doubled – Pitch: Halved/doubled – Melody
Pulse generator
Noise generator
LPC Filter
G A(z)
12 Speech Synthesis Torbjørn Svendsen
Rule-based formant generation
• Formants are slowly varying – Update rates of 5-10ms sufficient
• Target values describe stationary conditions, {Fi, Bi} • Rules describe transition between phones
Parameters describe transition shape
Specific rules for all transition types
13 Speech Synthesis Torbjørn Svendsen
Rule based formant generation
14 Speech Synthesis Torbjørn Svendsen
Formant synthesis
• Flexible • Produces intelligible speech with few parameters • Simple implementation
– Rule derivation is complex, development can be costly • Limited naturalness • NOTE: Given sufficient training data, formant generation can be data
driven, e.g. using an HMM in ”production mode” for generating the formant tracks (Acero et. al) – Similar approach for LPC-based synthesis more recently by Tohkura
15 Speech Synthesis Torbjørn Svendsen
Articulatory synthesis
• The waveform production is performed by describing the movement of the articulators – Jaw opening, lip rounding, tongue placement and height, …. – Acoustics, fluid mechanics form basis
• Limited success – Complex theory – Computational difficulties and complexity
16 Speech Synthesis Torbjørn Svendsen
Synthesis by concatenation
• Concatenation of stored waveform fragments – Optional modification of the fragments (duration, pitch, formants)
• Dilemma:Use of unmodified fragments will either – Produce audible distortion at concatenation points (phase mismatch,
formant and pitch mismatch) – Lead to an enormous database to cover all phonetic and prosodic events
• How much modification is possible before degradation is audible?
17 Speech Synthesis Torbjørn Svendsen
Some central issues
1. Which unit? 2. How to design the acoustic ”library” (inventory)?
– Content, recording conditions, reading style – Annotation – type, level – Segmentation and labeling – consistency, effort, automation?
3. How to select the best sequence of units from the acoustic inventory 4. How to perform prosodic modification of the selected sequence?
18 Speech Synthesis Torbjørn Svendsen
Some central issues
1. Which unit? 2. How to design the acoustic ”library” (inventory)?
– Content, recording conditions, reading style – Annotation – type, level – Segmentation and labeling – consistency, effort, automation?
3. How to select the best sequence of units from the acoustic inventory 4. How to perform prosodic modification of the selected sequence?
19 Speech Synthesis Torbjørn Svendsen
1. Which unit?
• Longer unit leads to better quality, but – requires more data to be stored – Is more context dependent
20 Speech Synthesis Torbjørn Svendsen
Unit requirements
• Low concatenation distortion – Longer units -> less concatenation points – Units containing attractive concatenation points
• Low prosodic distortion – Small inventory means prosodic modification necessary – Modification introduces distortion
• Unit should be generalizable – Need to be able to synthesize sequences that were not in the original
inventory (except for limited domain synthesis) • Unit should be ”trainable”
– Finite training data sufficient to estimate or predict all units
21 Speech Synthesis Torbjørn Svendsen
Coverage
• Complete coverage of all phonetic and prosodic events is impossible – Large Number of Rare Events
22 Speech Synthesis Torbjørn Svendsen
Some possible unit choices
• Context independent phonemes – Bad concatenation properties
• Context dependent phonemes – Reduces discontinuity problems – Large number (~125k), needs to be reduced
• E.g. generalized triphones or phonetic decision trees • Diphones (dyads)
– ~2500 possible units – Reduces discontinuity problems – Widespread use
• Sub-phonemic units – Increased use (e.g. IBM, AT&T) – Half-phones (AT&T), phone HMM-state (IBM)
23 Speech Synthesis Torbjørn Svendsen
Some posible unit choices (cont.)
• Syllables, words and phrases – Mainly used for limited domain applications
• Fixed message repertoire – Potentially good quality – Demands large storage
• And much data collection – Computationally demanding
• Complex search in large database – Syllables or demi-syllables most interesting
24 Speech Synthesis Torbjørn Svendsen
Some central issues
1. Which unit? 2. How to design the acoustic ”library” (inventory)?
– Content, recording conditions, reading style – Annotation – type, level – Segmentation and labeling – consistency, effort, automation?
3. How to select the best sequence of units from the acoustic inventory 4. How to perform prosodic modification of the selected sequence?
25 Speech Synthesis Torbjørn Svendsen
Designing the acoustic inventory
• Recordings from one speaker, appropriately annotated • Voice talent very important for resulting quality • Design choice: Rely on prosodic modification by signal processing or
aim for good coverage of natural prosodic variation in database • Prosodic modification at synthesis – PSOLA type synthesis
– Typically diphone units – Normally desirable to have nearly constant (neutral) F0 – Nonsense words/sentences with (near) full diphone coverage – Small database (~5 minutes of speech contains the essential units)
26 Speech Synthesis Torbjørn Svendsen
Designing the acoustic inventory (2) • Unit selection synthesis – rely on natural prosodic variation
– Representative speech – speaking style defined by database – Many representations of each phonetic unit
• Gives prosodic variation – Large database – Facitilitates longer units, variable units – Requires search for the best unit sequence – Rich phonetic and prosodic context
• Typically ”real” texts – Text selection:
• Start with large number of natural sentences • Analyze sentences, predict phonetic and prosodic realization • Use some greedy algorithm to obtain the best coverage possible with a small
number of sentences (2000-4000, typically) • Design supplementary sentences to improve coverage
27 Speech Synthesis Torbjørn Svendsen
Coverage - LNRE
• Large number of units with small probability of occurrence • If database units are selected randomly, the probability of
encountering a unit not in the database approaches certainty for a small sequence of randomly selected sentences.
• Unit inventory must be chosen with care • Fall-back solutions must exist for non-covered units
P(unit)
28 Speech Synthesis Torbjørn Svendsen
Annotation
• For small databases, speech can be segmented and annotated manually – Phonemic and prosodic annotation can be detailed
• For unit selection databases automation is necessary – Automatic or semi-automatic methods for segmentation in phonemic and
prosodic units – Annotation can be fairly high-level without loss of quality – Annotation level and cost function for unit selection are closely linked
29 Speech Synthesis Torbjørn Svendsen
Some central issues
1. Which unit? 2. How to design the acoustic ”library” (inventory)?
– Content, recording conditions, reading style – Annotation – type, level – Segmentation and labeling – consistency, effort, automation?
3. How to select the best sequence of units from the acoustic inventory 4. How to perform prosodic modification of the selected sequence?
30 Speech Synthesis Torbjørn Svendsen
3. Optimal unit string
• Selection problem arises when there are several possible choices for the unit sequence
• Traditional diphone synthesis has only one exemplar of each unit – Trivial solution
• Selection is made based on desire for naturalness and minimum discontinuity due to – Different phonetic contexts – Segmentation errors in the database – Acoustic variability – Prosodic differences (pitch discontinuity, formant tracks)
• Search problem – Must define an object function to be minimized
31 Speech Synthesis Torbjørn Svendsen
Object function for search
d(!,T) = du(! j ,t j )j=1
N
" + dt (! j ,! j+1)j=1
N #1
"
! = {!1,!2,...,!N} # Candidate segment sequenceT = {t1,t2,...,tN} - Target units
ˆ ! = argmin!
d(!,T)
Unit cost, du(*) Transition cost, dt(*)
Lattice of candidate units
Sequence of target units
32 Speech Synthesis Torbjørn Svendsen
Object function for search (2)
• How to choose unit and transition cost functions? – Empirical or data driven
• Empirical strategy: – Transition cost:
• If two segments originally spoken in succession, dt(*)=0 • Otherwise, cost as sum of prosodic and coarticulary cost • Prosodic cost proportional to difference in F0 (or logF0) at boundary • Coarticulary cost based on empirical knowledge of perceived distance
– Unit cost • Contribution from prosody and context • Prosody cost proportional to difference in F0 • Contextual cost by using a unit from a different phonetic context. Based on
empirical data.
33 Speech Synthesis Torbjørn Svendsen
Object function for search (3)
• Data driven cost function – Transition cost
• Measure of spectral discontinuity, e.g. spectral distance in the transition area (distance between the end frame of preceeding unit and first frame of succeding unit)
• (Optional) prosodic cost, e.g. magnitude of log(F0) difference – Unit cost
• Based on context • Examples:
– Same context means no cost, different context gives infinite cost – Generalized triphones(GT): Unit belongs to same GT means no cost,
otherwise cost is infinity – Phonetic decision trees, e.g. no cost for units at same leaf node
34 Speech Synthesis Torbjørn Svendsen
Optimal unit string selection
• Given – The object function to be minimized – A target sequence from the TTS front end – A unit inventory
• The minimization can be performed using standard dynamic programming techniques (Viterbi-style)
• Similar to HMM decoding, but no probabilities, only cost values • Search can be further simplified by e.g. clustering of units
– Initial search using units representing each cluster – Search refinement by selecting best cluster member as selected unit
35 Speech Synthesis Torbjørn Svendsen
Some central issues
1. Which unit? 2. How to design the acoustic ”library” (inventory)?
– Content, recording conditions, reading style – Annotation – type, level – Segmentation and labeling – consistency, effort, automation?
3. How to select the best sequence of units from the acoustic inventory 4. How to perform prosodic modification of the selected sequence?
36 Speech Synthesis Torbjørn Svendsen
4. Prosodic modification
• Techniques for prosodic modification (pitch, duration) mandatory when unit inventory is small
• Also desirable for unit selection synthesis due to LNRE • Main issue: How to be able to achieve (at least moderate) prosodic
modification of a unit (sequence) without introducing annoying distortion
• Example 1: Original - duration - pitch - duration and pitch
• Example: Original – duration – pitch – duration and pitch
37 Speech Synthesis Torbjørn Svendsen
(Synchronous) Overlap and Add
• OLA: Time-scale modification, fixed distance between analysis windows. Produces irregular pitch periods
• SOLA: Analysis window placed at position which gives max correlation btw. windows
2N
N
38 Speech Synthesis Torbjørn Svendsen
Pitch Synchronous OLA (PSOLA)
• Window is pitch synchronous – centered around an excitation pulse – Duration equal to two pitch periods, 2*T0
• Allows for simple modification of pitch frequency • Can also modify duration • Unvoiced sounds:
– Fixed window length (< 10 ms) – Can invert every other repeated segment in order to avoid periodicities
when expanding duration • Can provide high quality as long as the degree of modification is
relatively low (<2)
39 Speech Synthesis Torbjørn Svendsen
PSOLA – principle, F0 change T=1.25*T0 Original Epochs Shift Re-harmonized
signal
40 Speech Synthesis Torbjørn Svendsen
PSOLA – duration and F0 modification
41 Speech Synthesis Torbjørn Svendsen
PSOLA principle
!"
#"=
#=k
kTnne )()( 0$
s(n) !"
#"=
#==k
kTnsnsnenx )()(*)()( 0
• s(n) determines spectral envelope
• Using an appropriate, pitch synchronous window, T0 can be changed without changing spectral envelope (exact match at f=k*F0)
• Window type and degree of change will determine distortion outside pitch harmonics (interpolated values, correctness determined by window sidelobes)
42 Speech Synthesis Torbjørn Svendsen
How to determine the synthesis epochs
• ts(j) – time instance for pitch pulse (epoch) i in the synthesis • Ps(t) – desired pitch period at time t • If Ps(t) is slowly varying, ts(j+1)- ts(j)=Ps (ts(j)) • Exact:
)()1(
)()()1(
)1(
)(
jtjt
dttPjtjt
ss
jt
jts
ss
s
s
!+=!+
"+
– Next pulse offset by mean pitch within the synthesis interval – Iterative calculation
43 Speech Synthesis Torbjørn Svendsen
Synthesis epoch calculation
44 Speech Synthesis Torbjørn Svendsen
Pitch modification, no time scaling
linearor constant normally is )(constant piecewise is )(
)()1(
)()()()1(
)()()( :pitch Desired)()1()( :pitch Original
)1(
)(
ttP
jtjt
dttPtjtjt
tPttPitittP
a
ss
jt
jta
ss
as
aaa
s
s
!
!
!
"+="+
="+=
#+
45 Speech Synthesis Torbjørn Svendsen
Changing the time scale ts = D(ta ) = ! (" )d
0
ta
! "
Presume ! (" ) =! (reduce speed when ! >1)
Similar derivation to pitch epoch determination
ts ( j +1)" ts ( j) =! Pa (t)dt
ts ( j )/!
ts ( j+1)/!
!ts ( j +1)" ts ( j)
If Pa (t) # Pa (constant in interval):
ts ( j +1)" ts ( j) =!Pa
46 Speech Synthesis Torbjørn Svendsen
Changing the time scale
47 Speech Synthesis Torbjørn Svendsen
All the modifications
)()1(
)()(
)()1(
:pitch and both time Changing/)1(
/)(
jtjt
dttPtjtjt
ss
jt
jta
ss
s
s
!+=!+
"+ #
#
$#
48 Speech Synthesis Torbjørn Svendsen
Epoch positioning in database
• Database must be annotated with pitch pulse location • Accurate positioning necessary for good performance • Automatic methods using pitch estimation techniques give reasonably
good results • Use of laryngograph (electroglottograph – EGG) during recording is
recommended – Measures resistance over vocal cords – dependent on glottal opening – Peak picking of derivative of EGG signal
49 Speech Synthesis Torbjørn Svendsen
Epochs from EGG signal
• Peak picking on EGG or its time derivative • Accurate epoch and F0 estimation • Voiced/unvoiced determination
Speech signal
EGG signal
Detected pulse locations
50 Speech Synthesis Torbjørn Svendsen
PSOLA limitations
• Amplitude mismatch • Voiced fricatives
– Increased buzzyness • All modification will introduce distortion and unnaturalness
– Degree dependent on amount of modification • Limits on maximum modification
51 Speech Synthesis Torbjørn Svendsen
Phase mismatches
• Wrong positioning of pitch pulses in database
• Causes glitches in output
52 Speech Synthesis Torbjørn Svendsen
Pitch mismatches
• Correct F0 and pulse position
• Different F0 in segments cause spectral and waveform discontinuities
53
HMMs for synthesis
• In speech recognition, Hidden Markov models are used to model speech production
– Task of recognizer is to find the model that best explains the observed utterance • If the HMM is used for generating observations, the produced feature vector
sequence can be used to produce speech from a given unit sequence (phone sequence)
– The feature vectors must be suitable for speech production – Combination of continuous and discrete elements – Modifications to HMM theory are necessary to facilitate the generative mode – Potential for efficient and flexible synthesis
• Basis for HMM-based synthesis
54
HMM-based speech synthesis
• Training from database • Produce excitation and filter parameters for e.g. LPC-type speech generation
55
Text-to-speech synthesis, a hybrid solution • Speech training
database • HMM based system for
prediction and unit selection
• Experimental system • Very good evaluation in
international competition (Blizzard Challenge 2010)
TRAINING DATA INPUT TEXT
HTS
training
Analysis
TTS frontend
Target model
construction
Candidate list
construction
Selection &
boundary decision
Waveform
concatenation
State
alignment
Voice
database
56 Speech Synthesis Torbjørn Svendsen
A few examples
• Diphone synthesis: Festival, Arne, Infovox
• Unit selection synthesis: Festival, AT&T NextGen
• HMM synthesis
• Hybrid HMM/Unit selection
• Limited domain unit selection synthesis
57 Speech Synthesis Torbjørn Svendsen
Summary
• Data-driven vs synthesis by rule. • Current synthesis generation is concatenative – waveform synthesis. • Single unit synthesis, “diphone synthesis”, requires units to be
prosodically modified. • Unit selection synthesis aims to use natural prosody and minimal
prosodic modification. • Issues in waveform synthesis:
– Unit definition. – Definition, realization and annotation of waveform library. – Unit selection - search. – Prosodic modification.