ENEE408G Capstone -- Multimedia Signal Processing (F'05) Digital Speech Processing and Coding...

transcript

ENEE408G Capstone -- Multimedia Signal Processing (F'05)

Digital Speech Processing and CodingDigital Speech Processing and Coding

Fall’05 Instructor: Carol Espy-Wilson

Electrical & Computer Engineering

University of Maryland, College Park

http://www.ece.umd.edu/class/enee408g/http://umd.blackbloard.com/

minwu@umd.edu

ENEE408G Spring ENEE408G Spring 20042004Lecture-2Lecture-2

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Lec2 – Introduction 2/4/04 [2]

Last LectureLast Lecture

Course overview and logistics

Bring multimedia to digital world: sampling & quantization

Introduction to speech processing– Different aspects of speech

Friday Lab Session– Speech Processing, Coding, Recognition, & HCI

Today: speech processing, coding, synthesis

Speech ProductionSpeech Production

Source-Filter View of Speech Production Source-Filter View of Speech Production (Stevens 1999)(Stevens 1999)

Source Spectrum

Vocal tract transfer function

Radiation Characteristics

Power spectrum of speech signal

1.0 2.0 3.0 4.0 0.0

Time (sec)

“Sprouted grains and seeds are used in salads and dishes such as chop suey”

0.1 0.3 0.5

fricativestopconsonant

glidevowel stop

consonantvowel

Phonetic Features (Chomsky & Halle, 1968)Phonetic Features (Chomsky & Halle, 1968)

There are three kinds of phonetic features – Source features determine the kind of excitation signal

– Manner of articulation features determine how open or closed is the vocal tract

– Place of articulation features determine the location of primary constriction

Source feature “voiced”Source feature “voiced”

-voiced +voiced

/z/ /s/

Source Feature voicedSource Feature voiced

0.1 0.3 0.5

“Sprouted”

Time (sec)vertical striations

+voiced

turbulence-voiced

Glottal Source (Klatt & Klatt 1990)

Modal Voice

Creaky Voice

Breathy Voice

Voice Quality-APP DetectorVoice Quality-APP Detector

Manner feature “sonorant”Manner feature “sonorant”

-sonorant+sonorant

/z/vowelPrimary source at glottis

Primary source above the glottis at alveolar ridge

Source Feature sonorant

0.1 0.3 0.5

“Sprouted”

Time (sec) low frequency energy+sonorant

high frequency energy-sonorant

Place feature for stop consonantsPlace feature for stop consonants

/p/ /t/

+labial +alveolar

Place Feature Labial vs. AlveolarPlace Feature Labial vs. Alveolar

falling

spectral prominence

labial /b/

Frequency (Hz)

risingfalling

spectral prominence

labial /p/

alveolar /t/dB

Frequency (Hz)

Place Feature Labial vs. AlveolarU

Source-Filter TheorySource-Filter Theory

First “speaking machine” in 1930s NY World’s Fair– 14 keys, 1 wristband, 1 pedal

Modeling speech productionas a linear system– Sound sources

Either voiced or unvoiced– Voice sound

Modeled by a generator of pulses

– Unvoiced sound Modeled by white noise

generator– Articulation

Modeled by a cascade of single-resonance (pole) digital filters

Figure 1 of SPM May’98Speech Survey

Linear Separable Model for Speech ProductionLinear Separable Model for Speech Production

Vocal tract is modeled as a linear time-varying system– Parameters of the linear system are slowly varying

– Excited by time-varying source (voiced or unvoiced)

Practical models– Model each speech frame

as Linear Time-Invariant

– Excited by either voicedor unvoiced source

– Allow overlaps in neighbouring frames

Figure 3.2 of Furui’s book

Speech CodingSpeech Coding

Statistical Properties of Speech Statistical Properties of Speech Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer

Lowpass filtered (0-3400 Hz)Lowpass filtered (0-3400 Hz)

Bandpass filtered Bandpass filtered

(200-3400 Hz)(200-3400 Hz)

Digital Coding of SpeechDigital Coding of Speech

0.050.054.84.87.27.2200200

waveform codingwaveform coding source codingsource coding

Synthetic Synthetic qualityquality

broadcastbroadcastqualityquality

1616 9.69.6tolltoll

qualityquality commun.commun.qualityquality

Waveform coders: quantize speech samples directly at high bit Waveform coders: quantize speech samples directly at high bit rates.rates.

Source coders (vocoders): use knowledge of speech production Source coders (vocoders): use knowledge of speech production to parameterize the signal (model based)to parameterize the signal (model based)

Hybrid coders: partly waveform based and partly model based Hybrid coders: partly waveform based and partly model based (2.4-16 kbps)(2.4-16 kbps)

kbpskbps

Information Capacity I=BfInformation Capacity I=Bfss

PCM codingPCM coding

How to encode a signal into bits?– Sampling and perform uniform quantization (2 parameters: , equal

quantization step size and B, # of bits) “Pulse Coded Modulation” (PCM) 8 bits per sample ~ good for speech 16 bits ~ needed for high-quality music

Tradeoff between fidelity and file size

How to “squeeze” out redundancy?

I(x,y)

Input signalSampler Quantizer Encoder

transmit

digitize/capture device

Discussion on Improving PCM (1)Discussion on Improving PCM (1)

2 parameters: step size , # of bits B

Peak-to-peak range is 2Xmax,

Assume – where e[n] is uncorrelated with x[n], and it is uniformly

distributed

ˆ[ ] [ ] [ ]e n x n x n

ppee[e][e]1

2 2max

max( ) 6 4.77 20log[ ]x

XSNR dB B

Uniform quantization Uniform quantization Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer

Uniform quantization may give inconsistent range of relative amount errors– E.g., +/- 2 incurs 20% vs. 2% at amplitude 10 and 100

Non-uniform quantization

– Assign smaller quantization step size at small amplitude

to maintain consistent range of relative quantization errors over the entire dynamic range

– Can apply non-linear transform before uniform quantization via “companding” (compression-expansion)

-law companding: international standard for 64kbps speech

[ ] ln | [ ] |y n x n

( [ ])[ ] ( [ ])y nx n e sign x n1 [ ] 0

( [ ])1 [ ] 0

x nsign x n

ˆ[ ] ln | [ ] | [ ]y n x n n

ˆ( [ ]) ( [ ])ˆ[ ] y n sign x nx n e[ ] [ ]ˆ[ ] | [ ] | ( [ ]) [ ]n nx n x n sign x n e x n e

ˆ[ ] [ ](1 [ ]) [ ] [ ] [ ]x n x n n x n x n n

ˆ[ ] [ ] [ ]x n x n e n

Discussion on Improving PCM (1) Discussion on Improving PCM (1) Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer

But, But, ln[0]

maxmax

| [ ] |log[1 ]

[ ] ( [ ])log[1 ]

y n X sign x n

not practicalnot practical

Discussion on Improving PCM (1)Discussion on Improving PCM (1)Log CompandingLog Companding Digital Speech ProcessingDigital Speech Processing by Rabiner and Shafer by Rabiner and Shafer

Quantized PCM values may not be equally likely– Can we do better than encode each value using same # bits?

Example– P(“0” ) = 0.5, P(“1”) = 0.25, P(“2”) = 0.125, P(“3”) = 0.125

– If use same # bits for all values Need 2 bits to represent the four possibilities if treat equally

– If use less bits for likely values “0” ~ Variable Length Codes (VLC) “0” => [0], “1” => [10], “2” => [110], “3” => [111] Use 1.75 bits on average ~ saves 0.25 bit per sample!

Bring probability into the picture– Use probability distribution to reduce average # bits per quantized

sample

How to Encode Correlated Sequence?How to Encode Correlated Sequence? Consider: high correlation between successive samples

Predictive coding– Basic principle: Remove redundancy between successive pixels and only encode

residual between actual and predicted

– Residue usually has much smaller dynamic range Allow fewer quantization levels for the same MSE => get

compression– Compression efficiency depends on intersample redundancy

First try

uQ (n)

Predictor+

uP(n) = uQ(n-1) DecodeDecode

Predictor

Quantizer_

e(n) eQ(n)

EncodeEncoderr

u’P(n) = u(n-1)

Predictive Coding (cont’d)Predictive Coding (cont’d)

Problem with 1st try– Input to predictor are different at

encoder and decoder decoder doesn’t know u(n)!

– Mismatch error could propagate to future reconstructed samples

Solution: Differential PCM (DPCM)

– Use quantized sequence uQ(n) for prediction at both encoder and decoder

– Prediction error e(n)

– Quantized prediction error eQ(n)

– Distortion d(n) = e(n) – eQ(n)

uQ (n)

Predictor+

uP(n)= uQ(n-1)

DecodeDecoderr

Think: Think: what predictor to use?what predictor to use?

EncodeEncoderr

Predictor

Quantizer_

e(n) eQ(n)

+uP(n) =uQ(n-1)

Linear Prediction Analysis of SpeechLinear Prediction Analysis of Speech

are called Linear Prediction Coefficients (LPC)

+[ ]s n ][ne+

][ne [ ]s n

Analysis Synthesis

Error Minimization

Normal equations

Can be solved using the famous Levinson Recursion, which leads to lattice formulation of the linear prediction solution

{ }ˆmin ( [ ]) ( [ ] [ ])

E E e n E s n s n ˆSa s

Source-Filter View of Speech ProductionSource-Filter View of Speech Production

e(t) v(t) r(t) s(t)

E() V() R() S()

s(t) = e(t)*v(t)*r(t)

S() = E()V()R()

All-Pole Modeling of SpeechAll-Pole Modeling of Speech

Auto-regressive (AR) model: all-pole filter

– H(z) is the overall transfer function

– Glottal Flow G(z), Vocal Tract V(z), Radiation R(z), Gain

Synthesis process:

u[n]: the vocal tract input, s[n]: speech output

( ) ( ) ( ) ( )( )1

H z G z V z R zA za z

zAzH ][n u ][ns

All-Pole Model and Linear PredictionAll-Pole Model and Linear Prediction

( ) ( ) 1P

U z A z a z

[̂ ] [ ]P

s n a s u k

Here is a linear prediction of order P for s[n]

)(zP +

[̂ ]s n

[ ]s n ][ne

where is the prediction error sequence ˆ[ ] [ ] [ ]e n s n s n

( ) ( ) ( )P

S z a S z z U z

ˆ [ ] [ ] [ ] [ ] [ ]P

s n a s n k u n s n e n

Model-based CodingModel-based Coding

Linear Prediction Coder (LPC)

– LPC Vocoder ( voice coder ) Divide speech into frames (several tens milliseconds) and

encode the LPC coefficients of each frame Additional parameters to facilitate synthesis:

voiced/unvoiced flag, gain, pitch (for voiced)

– Line Spectrum Pair (LSP) Coding

Hybrid Coding: LPC Residual Coding– Between LPC and waveform codingU

Line Spectrum Pair (LSP) CodingLine Spectrum Pair (LSP) Coding

Pros and Cons of LPC method– Good performance at coding rate down to 2.4kbps

– Synthesized voice becomes unnatural when below 2.4kbps

– When the poles are near the unit circle, quantization in LPC coefficients may result in instability.

LSP parameters– LSP are frequencies extracted from polynomials constructed from LPC

coefficients

– Frequency domain features (similar to formant)

=> produce less distortion due to quantization

[See details in Design Project on Speech]

Hybrid CodingHybrid Coding

“Hybrid” – between LPC and waveform coding– LPC Residual Coding: encode and slowly update LPC coefficients, and

send the LPC residual (e.g. encoded using Vector Quantization)

Advantages:– Free from quality degradation due to source modeling

– Low-frequency waveform is exactly reproduced

– Spectral information of the entire frequency range is preserved

– No need of pitch period estimation and voiced/unvoiced decision

Code-Excited Linear Predictive Coding (CELP)Code-Excited Linear Predictive Coding (CELP)

Multipulse-Excised Linear Predictive Coding (MPC)– Do not distinguish voiced/unvoiced sound explicitly

Code-Excited Linear Predictive Coding (CELP) – Replace the multi-pulses of MPC with vector-quantized sequences based

on long-term prediction of periodicity and short-term prediction

Speech Coding MethodsSpeech Coding Methods

– Waveform coding; Hybrid coding; Analysis-synthesis coding

Table 6.1 of Furui’s bookU

Speech Quality vs. Transmission RateSpeech Quality vs. Transmission Rate

Comparison of Different Speech Coding Tech.Comparison of Different Speech Coding Tech.

Table 6.2 of Furui’s book

Put Together: A Digital Telephone SystemPut Together: A Digital Telephone System

– 8kHz and 8-bit per sample for telephone speech => 64kbps

– Anti-aliasing filter before sampling

– Non-uniform quant-ization (e.g., through -law or A-law companding ~ signalcompression-expansion)

Speech SynthesisSpeech Synthesis

Speech SynthesisSpeech Synthesis Speech synthesis: a process that artificially produces speech

– Articulatory synthesis, Formant synthesis, and LPC synthesis

– Issues other than synthesizer structure: text analysis, etc.

Figure 7.2 of Furui’s bookU

Comparison of Synthesis MethodsComparison of Synthesis Methods

Table 7.1 of Furui’s book

Text-to-Speech Conversion SystemText-to-Speech Conversion System

=> See more in Design Project and try it out

Figure 7.8 of Furui’s bookU

Analysis/SynthesisAnalysis/Synthesis

Naturally spoken Naturally spoken utteranceutterance

Synthesized Synthesized utteranceutterance

Human Computer Interface/Interaction (HCI)Human Computer Interface/Interaction (HCI)

Multi-modal multimedia communications and interactions

– Info. & interface through speech/audio, image/video, graphics, etc.

Building blocks for speech based HCI

– Speech recognition and speaker identification

– Natural language understanding

– (Speech synthesis)

– Examples voice command, dictation Question-and-Answer: for intelligent customer

service, voice-based info. retrieval, call routing, ……

Enhance speech-based HCI with graphics: “talking head”

=> See more in Design Project and try it out

SummarySummary

Speech production and analysis– Spectrogram; Pitch, Formant

– Linear prediction model

Speech coding– Basic compression tools

Speech Synthesis

This week’s Lab session:– Design project#1 on Speech

Next lecture: speech recognition

AssignmentsAssignments

“The Past, Present, and Future of Speech Processing”

“Talk to the Machine”

ENEE408G Capstone -- Multimedia Signal Processing (F'05) Digital Speech Processing and Coding...

Documents