2015©Shinnosuke TAKAMICHI
09/19/2015
Prosody-Controllable HMM-Based
Speech Synthesis Using Speech Input
Yuri Nishigaki, Shinnosuke Takamichi, Tomoki Toda,
Graham Neubig, Sakriani Sakti, Satoshi Nakamura (NAIST)
MLSLP2015 in Aizu Univ.
/17
Speech-based creative activities
and HMM-based speech synthesis
2
Singing voice Speech
Advertisement Live concert Narration Next?
Video avatar
Voice actor
…
Useful method: HMM-based speech synthesis [Tokuda et al., 2013.]
“Synthesize!”
Synthetic speech parameters
text speech
/17
Manual control of synthetic speech
Laugh
Sad
Regression
Multi-Regression HMM [Nose et al., 2007.]
Manually manipulating HMM parameters
User
User
They are very useful, but difficult to control as the user wants.
/17
Motivation of this study
Functions we want
– Original capability of HMM-based TTS
– Speech-based control
• Intuitive to control
• Make synthetic speech mimic input speech prosody
Our work
– Speech synthesis having both functions
4
Synthesize System
Synthesize “Synthesize.”
MR-HMM etc.
Similar to VOCALISTENER for singing voice control
/17
Overview of the proposed system
(Only text is input.)
5
Input text
Text analysis
Waveform generation
Synthetic speech
Parameter
generation
Synthesis
HMM Original HMM-based
speech synthesis
/17
Overview of the proposed system
(Text & speech are input.)
6
Input text Input speech
Speech analysis Text analysis
Waveform generation
Synthetic speech
F0
modification
Duration
extraction
Parameter
generation
Alignment
HMM
Synthesis
HMM
/17
Duration extraction module
7
Alignment
HMM
Synthesis
HMM
Feature of
input speech
Context of
Input text
HMM
alignment
Duration
generation
State duration of
synthetic speech Parm. Gen.
Duration of input speech
/17
Alignment accuracy & duration unit
How to build alignment HMMs suitable for input speech?
– → The use of pre-recorded speech uttered by users
– Large amounts → user-dependent HMMs
– Small amounts → HMMs adapted from original alignment HMMs
How to map the input speech duration to synthetic speech?
– Alignment/synthesis HMM-states represent different speech segments.
– Which is better, HMM-state, phone, or mora-level duration unit?
8
/17
Speech parameter generation module
9
Synthesis
HMM
Context of
Input text
Parameter
generation
Spectrum of
synthetic speech
F0 generated
From HMMs
Dur. ext.
State duration
F0 mod. Wav. Gen.
/17
F0 modification module
10
Feature of
input speech
F0 generated
from HMMs
F0
conversion
U/V region
modification
Parm. gen.
F0 of
synthetic speech Wav. Gen.
/17
F0 conversion &
unvoiced/voiced modification
11
F0
Time
Reference
generated from HMMs
Input speech
F0-converted
U/V-modified
F0 conversion fixes F0 range of input speech to fit to reference.
U/V modification fixes the U/V region of input speech to fit to reference.
Linear
conversion
Spline
interpolation
EXPERIMENTAL EVALUATION
12
/17
Experimental Setup
13
Content Value/Setting
User 4 Japanese speakers (2 male & 2 female)
Target speaker 1 Japanese female speaker
Training data of synthesis HMMs
450 phoneme-balanced sentences, 16 kHz-sampled, 5 ms shift, reading style
Evaluation data 53 phoneme-balanced sentences
Speech features 25-dim. mel-cestrum, log F0, 5-band aperiodicity
Speech analyzer STRAIGHT [Kawahara et al., 1999.]
Text analyzer Open-jtalk
Acoustic model 5-state HSMM [Zen et al., 2007.]
1. duration unit & alignment HMM adaptation
2. synthesis HMM adaptation
3. effect of U/V modification
/17
Evaluation 1: duration unit &
alignment HMM adaptation
3 duration units
– State / phoneme / mora-level duration
4 HMMs using different amounts of pre-recorded speech
– 0 … target-speaker-dependent HMMs (= synthesis HMM)
– 1 … HMMs adapted using 1 utterance uttered by the user
– 56 … HMMs adapted using 56 utterances
– 450 … user-dependent HMMs
Evaluation
– MOS test on naturalness of synthetic speech
– DMOS test on prosody mimicking ability of synthetic speech
• Input speech is presented as reference.
14
/17
Result 1: duration unit &
alignment HMM adaptation
15
1
2
3
4
5 MOS on naturalness DMOS on prosody mimicking ability
0 1 56 450 utts.
We can confirm (1) adaptation is effective, and
(2) phoneme-level dur. is relatively robust.
No significant diff. No significant diff.
state phone mora
/17
Experiment 2: Effectiveness of U/V
modification in naturalness P
refe
ren
ce
sco
re o
n n
atu
raln
ess [%
]
0
20
40
60
80
100
Spkr1 Spkr2 Spkr3 Spkr4
U/V
mo
dific
atio
n r
atio
[%
]
0
5
10
15
20
Spkr1 Spkr2 Spkr3 Spkr4
w/o or w/ modification U->V or V->U modification
U/V modification can improve the naturalness!
(especially when many U frames of input speech are fixed.)
/17
Conclusion
2 functions to control synthetic speech
– An original function of HMM-based TTS
• MR-HMM or manual control
– Speech-based control
• Intuitive for users
2 main modules of our system
– Mimic duration.
• Copy duration of input speech to synthetic speech.
– Mimic F0 patterns.
• Copy dynamic F0 pattern of input speech to synthetic speech.
Future work
– HMM selection using text & speech 17