Post on 04-Jan-2016
transcript
ENEE408G Capstone -- Multimedia Signal Processing (F'05)
Digital Speech Processing and CodingDigital Speech Processing and Coding
Fall’05 Instructor: Carol Espy-Wilson
Electrical & Computer Engineering
University of Maryland, College Park
http://www.ece.umd.edu/class/enee408g/http://umd.blackbloard.com/
minwu@umd.edu
ENEE408G Spring ENEE408G Spring 20042004Lecture-2Lecture-2
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [2]
Last LectureLast Lecture
Course overview and logistics
Bring multimedia to digital world: sampling & quantization
Introduction to speech processing– Different aspects of speech
Friday Lab Session– Speech Processing, Coding, Recognition, & HCI
Today: speech processing, coding, synthesis
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y M
.Wu
& R
.Liu
© 2
00
2)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [3]
Speech ProductionSpeech Production
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [4]
Source-Filter View of Speech Production Source-Filter View of Speech Production (Stevens 1999)(Stevens 1999)
Source Spectrum
Vocal tract transfer function
Radiation Characteristics
Power spectrum of speech signal
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [5]
2000
4000
6000
8000
1.0 2.0 3.0 4.0 0.0
Fre
quen
cy (
kHz)
Time (sec)
“Sprouted grains and seeds are used in salads and dishes such as chop suey”
F2
2000
4000
6000
8000
0.1 0.3 0.5
Fre
quen
cy (
kHz)
fricativestopconsonant
glidevowel stop
consonantvowel
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [6]
Phonetic Features (Chomsky & Halle, 1968)Phonetic Features (Chomsky & Halle, 1968)
There are three kinds of phonetic features – Source features determine the kind of excitation signal
– Manner of articulation features determine how open or closed is the vocal tract
– Place of articulation features determine the location of primary constriction
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [7]
Source feature “voiced”Source feature “voiced”
-voiced +voiced
/z/ /s/
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [8]
Source Feature voicedSource Feature voiced
2000
4000
6000
8000
0.1 0.3 0.5
“Sprouted”
Fre
quen
cy (
kHz)
Time (sec)vertical striations
+voiced
turbulence-voiced
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [9]
Glottal Source (Klatt & Klatt 1990)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [10]
Modal Voice
Creaky Voice
Breathy Voice
Voice Quality-APP DetectorVoice Quality-APP Detector
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [11]
Manner feature “sonorant”Manner feature “sonorant”
-sonorant+sonorant
/z/vowelPrimary source at glottis
Primary source above the glottis at alveolar ridge
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [12]
Source Feature sonorant
2000
4000
6000
8000
0.1 0.3 0.5
“Sprouted”
Fre
quen
cy (
kHz)
Time (sec) low frequency energy+sonorant
high frequency energy-sonorant
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [13]
Place feature for stop consonantsPlace feature for stop consonants
/p/ /t/
+labial +alveolar
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [14]
Place Feature Labial vs. AlveolarPlace Feature Labial vs. Alveolar
falling
spectral prominence
dB
labial /b/
Frequency (Hz)
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [15]
risingfalling
spectral prominence
labial /p/
alveolar /t/dB
Frequency (Hz)
Place Feature Labial vs. AlveolarU
MC
P E
NE
E4
08
G S
lide
s (c
rea
ted
by
Ca
rol E
spy-
Wils
on
© 2
00
4)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [16]
Source-Filter TheorySource-Filter Theory
First “speaking machine” in 1930s NY World’s Fair– 14 keys, 1 wristband, 1 pedal
Modeling speech productionas a linear system– Sound sources
Either voiced or unvoiced– Voice sound
Modeled by a generator of pulses
– Unvoiced sound Modeled by white noise
generator– Articulation
Modeled by a cascade of single-resonance (pole) digital filters
Figure 1 of SPM May’98Speech Survey
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y M
.Wu
© 2
00
3)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [17]
Linear Separable Model for Speech ProductionLinear Separable Model for Speech Production
Vocal tract is modeled as a linear time-varying system– Parameters of the linear system are slowly varying
– Excited by time-varying source (voiced or unvoiced)
Practical models– Model each speech frame
as Linear Time-Invariant
– Excited by either voicedor unvoiced source
– Allow overlaps in neighbouring frames
Figure 3.2 of Furui’s book
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y R
. L
iu &
M.W
u ©
20
02
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [18]
Speech CodingSpeech Coding
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [19]
Statistical Properties of Speech Statistical Properties of Speech Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [20]
Statistical Properties of Speech Statistical Properties of Speech Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer
Lowpass filtered (0-3400 Hz)Lowpass filtered (0-3400 Hz)
Bandpass filtered Bandpass filtered
(200-3400 Hz)(200-3400 Hz)
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [21]
Statistical Properties of Speech Statistical Properties of Speech Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [22]
Digital Coding of SpeechDigital Coding of Speech
0.050.054.84.87.27.2200200
waveform codingwaveform coding source codingsource coding
Synthetic Synthetic qualityquality
6464
broadcastbroadcastqualityquality
1616 9.69.6tolltoll
qualityquality commun.commun.qualityquality
Waveform coders: quantize speech samples directly at high bit Waveform coders: quantize speech samples directly at high bit rates.rates.
Source coders (vocoders): use knowledge of speech production Source coders (vocoders): use knowledge of speech production to parameterize the signal (model based)to parameterize the signal (model based)
Hybrid coders: partly waveform based and partly model based Hybrid coders: partly waveform based and partly model based (2.4-16 kbps)(2.4-16 kbps)
kbpskbps
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
Information Capacity I=BfInformation Capacity I=Bfss
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [23]
PCM codingPCM coding
How to encode a signal into bits?– Sampling and perform uniform quantization (2 parameters: , equal
quantization step size and B, # of bits) “Pulse Coded Modulation” (PCM) 8 bits per sample ~ good for speech 16 bits ~ needed for high-quality music
Tradeoff between fidelity and file size
How to “squeeze” out redundancy?
I(x,y)
Input signalSampler Quantizer Encoder
transmit
digitize/capture device
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y M
.Wu
& R
.Liu
© 2
00
2)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [25]
Discussion on Improving PCM (1)Discussion on Improving PCM (1)
2 parameters: step size , # of bits B
Peak-to-peak range is 2Xmax,
Assume – where e[n] is uncorrelated with x[n], and it is uniformly
distributed
max2
2B
X
ˆ[ ] [ ] [ ]e n x n x n
ppee[e][e]1
2
2
2 2
2 2max
3 2
[ ]
Bx
e x
SNRX
max( ) 6 4.77 20log[ ]x
XSNR dB B
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
2
12e
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [26]
Uniform quantization Uniform quantization Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [27]
Discussion on Improving PCM (1)Discussion on Improving PCM (1)
Uniform quantization may give inconsistent range of relative amount errors– E.g., +/- 2 incurs 20% vs. 2% at amplitude 10 and 100
Non-uniform quantization
– Assign smaller quantization step size at small amplitude
to maintain consistent range of relative quantization errors over the entire dynamic range
– Can apply non-linear transform before uniform quantization via “companding” (compression-expansion)
-law companding: international standard for 64kbps speech
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y M
.Wu
& R
.Liu
© 2
00
2)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [28]
Discussion on Improving PCM (1)Discussion on Improving PCM (1)
[ ] ln | [ ] |y n x n
( [ ])[ ] ( [ ])y nx n e sign x n1 [ ] 0
( [ ])1 [ ] 0
x nsign x n
x n
2
2 2 2
1x
x e e
SNR
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
ˆ[ ] ln | [ ] | [ ]y n x n n
ˆ( [ ]) ( [ ])ˆ[ ] y n sign x nx n e[ ] [ ]ˆ[ ] | [ ] | ( [ ]) [ ]n nx n x n sign x n e x n e
ˆ[ ] [ ](1 [ ]) [ ] [ ] [ ]x n x n n x n x n n
ˆ[ ] [ ] [ ]x n x n e n
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [29]
Discussion on Improving PCM (1) Discussion on Improving PCM (1) Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer
But, But, ln[0]
maxmax
| [ ] |log[1 ]
[ ] ( [ ])log[1 ]
x nX
y n X sign x n
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
not practicalnot practical
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [30]
Discussion on Improving PCM (1)Discussion on Improving PCM (1)Log CompandingLog Companding Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y C
aro
l Esp
y-W
ilso
n ©
20
04
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [31]
Discussion on Improving PCM (2)Discussion on Improving PCM (2)
Quantized PCM values may not be equally likely– Can we do better than encode each value using same # bits?
Example– P(“0” ) = 0.5, P(“1”) = 0.25, P(“2”) = 0.125, P(“3”) = 0.125
– If use same # bits for all values Need 2 bits to represent the four possibilities if treat equally
– If use less bits for likely values “0” ~ Variable Length Codes (VLC) “0” => [0], “1” => [10], “2” => [110], “3” => [111] Use 1.75 bits on average ~ saves 0.25 bit per sample!
Bring probability into the picture– Use probability distribution to reduce average # bits per quantized
sample
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y M
.Wu
& R
.Liu
© 2
00
2)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [40]
How to Encode Correlated Sequence?How to Encode Correlated Sequence? Consider: high correlation between successive samples
Predictive coding– Basic principle: Remove redundancy between successive pixels and only encode
residual between actual and predicted
– Residue usually has much smaller dynamic range Allow fewer quantization levels for the same MSE => get
compression– Compression efficiency depends on intersample redundancy
First try
uQ (n)
Predictor+
eQ(n)
uP(n) = uQ(n-1) DecodeDecode
rr
u(n)
Predictor
Quantizer_
e(n) eQ(n)
EncodeEncoderr
u’P(n) = u(n-1)
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y M
.Wu
& R
.Liu
© 2
00
2)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [41]
Predictive Coding (cont’d)Predictive Coding (cont’d)
Problem with 1st try– Input to predictor are different at
encoder and decoder decoder doesn’t know u(n)!
– Mismatch error could propagate to future reconstructed samples
Solution: Differential PCM (DPCM)
– Use quantized sequence uQ(n) for prediction at both encoder and decoder
– Prediction error e(n)
– Quantized prediction error eQ(n)
– Distortion d(n) = e(n) – eQ(n)
uQ (n)
Predictor+
eQ(n)
uP(n)= uQ(n-1)
DecodeDecoderr
Think: Think: what predictor to use?what predictor to use?
EncodeEncoderr
u(n)
Predictor
Quantizer_
e(n) eQ(n)
+uP(n) =uQ(n-1)
uQ(n)
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y M
.Wu
& R
.Liu
© 2
00
2)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [43]
Linear Prediction Analysis of SpeechLinear Prediction Analysis of Speech
are called Linear Prediction Coefficients (LPC)
a1
a2
aP
+[ ]s n ][ne+
_1z
1z
1z
a1
a2
aP
++
_ 1z
1z
1z
][ne [ ]s n
Analysis Synthesis
{ }ia
Error Minimization
Normal equations
Can be solved using the famous Levinson Recursion, which leads to lattice formulation of the linear prediction solution
22
{ }ˆmin ( [ ]) ( [ ] [ ])
kan n
E E e n E s n s n ˆSa s
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y R
. L
iu &
M.W
u ©
20
02
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [44]
Source-Filter View of Speech ProductionSource-Filter View of Speech Production
e(t) v(t) r(t) s(t)
E() V() R() S()
s(t) = e(t)*v(t)*r(t)
S() = E()V()R()
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [45]
All-Pole Modeling of SpeechAll-Pole Modeling of Speech
Auto-regressive (AR) model: all-pole filter
– H(z) is the overall transfer function
– Glottal Flow G(z), Vocal Tract V(z), Radiation R(z), Gain
Synthesis process:
u[n]: the vocal tract input, s[n]: speech output
1
( ) ( ) ( ) ( )( )1
Pk
kk
H z G z V z R zA za z
)(
)(
zAzH ][n u ][ns
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y R
. L
iu &
M.W
u ©
20
02
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [46]
All-Pole Model and Linear PredictionAll-Pole Model and Linear Prediction
1
( )
( ) ( ) 1P
kk
k
S z
U z A z a z
1
[̂ ] [ ]P
kk
s n a s u k
Here is a linear prediction of order P for s[n]
)(zP +
+
_
[̂ ]s n
[ ]s n ][ne
where is the prediction error sequence ˆ[ ] [ ] [ ]e n s n s n
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y R
. L
iu &
M.W
u ©
20
02
)
1
( ) ( ) ( )P
kk
k
S z a S z z U z
1
ˆ [ ] [ ] [ ] [ ] [ ]P
kk
s n a s n k u n s n e n
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [47]
Model-based CodingModel-based Coding
Linear Prediction Coder (LPC)
– LPC Vocoder ( voice coder ) Divide speech into frames (several tens milliseconds) and
encode the LPC coefficients of each frame Additional parameters to facilitate synthesis:
voiced/unvoiced flag, gain, pitch (for voiced)
– Line Spectrum Pair (LSP) Coding
Hybrid Coding: LPC Residual Coding– Between LPC and waveform codingU
MC
P E
NE
E4
08
G S
lide
s (c
rea
ted
by
R.
Liu
& M
.Wu
© 2
00
2)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [48]
Line Spectrum Pair (LSP) CodingLine Spectrum Pair (LSP) Coding
Pros and Cons of LPC method– Good performance at coding rate down to 2.4kbps
– Synthesized voice becomes unnatural when below 2.4kbps
– When the poles are near the unit circle, quantization in LPC coefficients may result in instability.
LSP parameters– LSP are frequencies extracted from polynomials constructed from LPC
coefficients
– Frequency domain features (similar to formant)
=> produce less distortion due to quantization
[See details in Design Project on Speech]
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y R
. L
iu &
M.W
u ©
20
02
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [49]
Hybrid CodingHybrid Coding
“Hybrid” – between LPC and waveform coding– LPC Residual Coding: encode and slowly update LPC coefficients, and
send the LPC residual (e.g. encoded using Vector Quantization)
Advantages:– Free from quality degradation due to source modeling
– Low-frequency waveform is exactly reproduced
– Spectral information of the entire frequency range is preserved
– No need of pitch period estimation and voiced/unvoiced decision
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y R
. L
iu &
M.W
u ©
20
02
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [52]
Code-Excited Linear Predictive Coding (CELP)Code-Excited Linear Predictive Coding (CELP)
Multipulse-Excised Linear Predictive Coding (MPC)– Do not distinguish voiced/unvoiced sound explicitly
Code-Excited Linear Predictive Coding (CELP) – Replace the multi-pulses of MPC with vector-quantized sequences based
on long-term prediction of periodicity and short-term prediction
Figure 6.32 of Furui’s book
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y R
. L
iu &
M.W
u ©
20
02
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [53]
Speech Coding MethodsSpeech Coding Methods
– Waveform coding; Hybrid coding; Analysis-synthesis coding
Table 6.1 of Furui’s bookU
MC
P E
NE
E4
08
G S
lide
s (c
rea
ted
by
R.
Liu
& M
.Wu
© 2
00
2)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [54]
Speech Quality vs. Transmission RateSpeech Quality vs. Transmission Rate
Figure 6.2 of Furui’s book
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y R
. L
iu &
M.W
u ©
20
02
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [55]
Comparison of Different Speech Coding Tech.Comparison of Different Speech Coding Tech.
Table 6.2 of Furui’s book
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y R
. L
iu &
M.W
u ©
20
02
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [56]
Put Together: A Digital Telephone SystemPut Together: A Digital Telephone System
– 8kHz and 8-bit per sample for telephone speech => 64kbps
– Anti-aliasing filter before sampling
– Non-uniform quant-ization (e.g., through -law or A-law companding ~ signalcompression-expansion)
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y M
.Wu
& R
.Liu
© 2
00
2)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [57]
Speech SynthesisSpeech Synthesis
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y M
.Wu
& R
.Liu
© 2
00
2)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [58]
Speech SynthesisSpeech Synthesis Speech synthesis: a process that artificially produces speech
– Articulatory synthesis, Formant synthesis, and LPC synthesis
– Issues other than synthesizer structure: text analysis, etc.
Figure 7.2 of Furui’s bookU
MC
P E
NE
E4
08
G S
lide
s (c
rea
ted
by
R.
Liu
& M
.Wu
© 2
00
2)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [59]
Comparison of Synthesis MethodsComparison of Synthesis Methods
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y R
.Liu
© 2
00
2)
Table 7.1 of Furui’s book
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [60]
Text-to-Speech Conversion SystemText-to-Speech Conversion System
=> See more in Design Project and try it out
Figure 7.8 of Furui’s bookU
MC
P E
NE
E4
08
G S
lide
s (c
rea
ted
by
R.
Liu
& M
.Wu
© 2
00
2)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [61]
Analysis/SynthesisAnalysis/Synthesis
Naturally spoken Naturally spoken utteranceutterance
Synthesized Synthesized utteranceutterance
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [62]
Human Computer Interface/Interaction (HCI)Human Computer Interface/Interaction (HCI)
Multi-modal multimedia communications and interactions
– Info. & interface through speech/audio, image/video, graphics, etc.
Building blocks for speech based HCI
– Speech recognition and speaker identification
– Natural language understanding
– (Speech synthesis)
– Examples voice command, dictation Question-and-Answer: for intelligent customer
service, voice-based info. retrieval, call routing, ……
Enhance speech-based HCI with graphics: “talking head”
=> See more in Design Project and try it out
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y R
. L
iu &
M.W
u ©
20
02
)
ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [63]
SummarySummary
Speech production and analysis– Spectrogram; Pitch, Formant
– Linear prediction model
Speech coding– Basic compression tools
Speech Synthesis
This week’s Lab session:– Design project#1 on Speech
Next lecture: speech recognition
UM
CP
EN
EE
40
8G
Slid
es
(cre
ate
d b
y M
.Wu
& R
.Liu
© 2
00
2)