ltu 020508 speech coding overview to pdf · • Decoded Speech – encoded by PCM or output in a...

Uen Rev PA1 2002-05-07 1

��

Jonas SvedbergAWARE, Advanced Wireless Algorithm Research and

ExperimentsMultimedia Technologies, Ericsson Research

Uen Rev PA1 2002-05-07 2

��

• Who am I• ��

• ��

• ��

• ��

• �� !�"#

• "��

Uen Rev PA1 2002-05-07 3

��

• M Sc in Engineering Physics (LTU –93)• Research Engineer at Ericsson Research (1993- )• Still working with speech coding

– Development– Verification– Standardization, (Europe/Japan/USA, ETSI/ARIB/3GPP2)

Uen Rev PA1 2002-05-07 4

��

Basic paradigm – trade bits for processing/transmission power

Lossless coding Lossy coding

Zip shorten … Speech (CELP)

Audio (mp3, AAC)

Video (MPEG1,2)

��

Compression

Uen Rev PA1 2002-05-07 5

��

• Compression -> less bits to transmit• Less bits to transmit -> less channel interference• Less channel interference -> more users in one cell• More users in one cell -> money to the operator

��!"#$

%�%�%

� �

��&��

�

Interference limit

�� '��

Uen Rev PA1 2002-05-07 6

��

• Compression -> less bits to transmit• Less bits to transmit -> speech sent on smaller channels• Smaller channels -> more channels• More channels -> more traffic• More traffic -> more money to the operator

Example: Multiplexing gains• PSTN: 64kbps PCM -> 8*8kbps ITU/G.729, 8xRevenue• Cellular: 13kbps GSMFR -> 2*[email protected], 2 x Revenue

Uen Rev PA1 2002-05-07 7

��%��

• Use a mixture of modeling and waveform compression to re-create original sound as accurately as possible without introducing artifacts, using as few bits as possible.

Audio coding –complex input signals

mainly use characteristics ofthe auditory system

Speech Coding – use model characteristics

from speech production organs andauditory system characteristics

Uen Rev PA1 2002-05-07 8

��

• Conversational speech is delay sensitive– Speech coding algorithms use small buffers, typically 20-30 ms

• Conversational speech is by nature a Real time service– Speech coding is(was) complexity limited

• None of the above drawbacks exists for the normal Audio coding Application

Uen Rev PA1 2002-05-07 9

�� "��

First generation (64-32 kbit/s)Waveform coding, very low complexity, Wireline quality

Second generation (16-8 kbps)(First cellular codecs. Low complexity, Communications quality)

Third generation (13-4.0 kbps)Second Generation Cellular/wireline, Medium to high complexity,Wireline to communications quality

Year

1972-1986

1986-1992

1993-

Fourth generation 8.0-2.0 kbps

To be developed

Uen Rev PA1 2002-05-07 10

��(� ��

)

*)

+)

,)

-)

.)

/)

*01) *02) *00) +))) +)*)

3��

��

��4��

��#��

5(�

��

Uen Rev PA1 2002-05-07 11

��4��

• Model the speech production apparatus for efficient compression

• Only code and transmit perceptually relevant information• Hide quantization noise as much as possible• Combine algorithms to keep artifacts at an absolute

minimum• Cellular algorithms:

– Also need to provide robustness against errors, limits use of error sensitive backward prediction

Uen Rev PA1 2002-05-07 12

��6��

��

��

��7��

��7��

��8��

7��

��

9

��

��

Uen Rev PA1 2002-05-07 13

��6��

• Input signal– Talker + acoustic interference

(noise, strange non-speech inputs, tones, music)

• Channel– Unreliable transport -> Frame and bit errors

• Decoded Speech– encoded by PCM or output in a small (lousy) speaker

Uen Rev PA1 2002-05-07 14

�� :

• Technology Evolution makes new standards desirable• Codecs have been optimized for a given transport channel• All applications do not have the same quality demands

– E.g. Military Comm. Needs intelligibility, encryption– Satellite Comm. Needs channels

• Political reasons (lower licensing costs)• One standards organization per continent.

Uen Rev PA1 2002-05-07 15

��;��8&4��

• Speech Modeling– Short term Prediction– Long term prediction

• Analysis by Synthesis• Error Weighting• Codebooks

– Adaptive– Fixed

• Post filtering• Error Concealment• The GSMEFR Algorithm

Uen Rev PA1 2002-05-07 16

��

Uen Rev PA1 2002-05-07 17

<��

• Vocal tract physically Modeled as concatenated tubes with varying cross-sections

• Mathematical modeled as a Lattice filter• Lattice filter may be converted to an all-pole (IIR model)$%&'()*�%&'+��%&'()!��%��&!�'+��()��

• ,�� -�� .��+�� %�'�(��%�� %�!�''�+��()��

• #�/��#�-��+�� +�01�� 2�)345

Uen Rev PA1 2002-05-07 18

<��(��!*$

• 10th order STP predictor

$%&'()*�%&'+

'��!=��>�8�� $

��

Uen Rev PA1 2002-05-07 19

<��(��!+$

• 10th order STP predictor

$%&'()*�%&'+

��8(��

��

Uen Rev PA1 2002-05-07 20

��&��• LPC analyzes the speech signal by estimating the formants• The LPC parameters are transmitted and used as an input to LPC

synthesis in the receiver end• Because speech signals vary, LPC is done in short frames, normally

25 to 100 frames per second.

Uen Rev PA1 2002-05-07 21

&��4�� ?��?

0 50 100 150 200 250-1

-0.5

0

0.5

1Waveform

0 500 1000 1500 2000 2500 3000 3500 4000-60

-40

-20

0

20

40

Am

plitu

de (

dB)

Frequency (Hz)

LPC and FFT

Uen Rev PA1 2002-05-07 22

&4��(��

0 500 1000 1500 2000 2500 3000 3500 4000-20

-10

0

10

20

Frequency (Hz)

Am

plitu

de (

dB)

LPC spectrum

0 20 40 60 80 100 120 140 160 180-500

0

500

Waveform

Uen Rev PA1 2002-05-07 23

4��4��@ &��• Vocal cords produces the signal, which is characterized by its

intensity (loudness) and frequency (pitch).

• Long term correlation is represented by lag. Lag is the number of samples between long-term periods in continuous signal.

• The range of lag values for range between 16-145 corresponding to the frequency range 500-55 Hz.

0 1 0 2 0 3 0 4 0 5 0- 0 .5

- 0 .4

- 0 .3

- 0 .2

- 0 .1

0

0 .1

0 .2

0 .3

0 .4

0 .5

T i m e ( m s )

Am

plitu

de

& � = �

Uen Rev PA1 2002-05-07 24

��

• Different coder structures exist– waveform coding

• (may code almost any input signal) – Vocoding

• (speech specific and parametric)– Hybrid coding

• (speech specific but with waveform matching capabilities)– Parametric+hybrid

• Very low rate codecs switch between optimal structures

• Currently ��!-��-�� (LPAS) coders offer best performance

Uen Rev PA1 2002-05-07 25

4��8(��

• DPCM (Differential Pulse Code Modulation)

Q

P(z)

P(z)+

-

Output

��

Input

��Q-1 Q-1

index

channel

+

e

��

��

Uen Rev PA1 2002-05-07 26

8�� 6�� !<��$

Noise

1/A(z)

Uen Rev PA1 2002-05-07 27

8�� 6�� !<��$>��%

• Binary model -> sensitive to excitation source selection– Pulse train model not generic enough – Noise model only good for unvoiced sounds

Uen Rev PA1 2002-05-07 28

" 6��8&4��!��>�*02-$

Noise-CB

Z-Lag

1/A(z)

Uen Rev PA1 2002-05-07 29

��(� ��#��" 6��8&4

��Noise like

Z-Lag

1/A(z)

Subframedelay

+

This Pitch Model is normally referred to as an Adaptive CB

Uen Rev PA1 2002-05-07 30

A��&4��6 ��

• Use a codec with various model parameters• Try synthesizing signals with different parameter values• Analyze which are the best parameter values

• Hence the name Linear Prediction ��!-�!��

• ,��)356�-��%�� '

Uen Rev PA1 2002-05-07 31

�� 6 �� >��*

Output

Input

()2�

Signal generator

MSE

Error

Uen Rev PA1 2002-05-07 32

�� 6 �� >��+

Synthesis Output

Input

-noise

Z-LAG

1/A(z)

Minimize

Error

LP Ind.Pitch/ACB IndInnov/FCB Ind.

Channel

Uen Rev PA1 2002-05-07 33

��+

• All parameters – (synthesis filter,pitch delay, gains) are time-variant

• All parameters are quantized before transmission• Mean Squared Error in the output is minimized (MSE)• Some preprocessing may be needed

– Bandwidth analysis (Codec has bandwidth limitations)– speech enhancement (Noise Suppression)

Uen Rev PA1 2002-05-07 34

��6 ��>��,>�4��

• Noise is not perceived uniformly– But MSE error provide a flat quantization noise spectrum

• Rule of thumb: frequency bands with more energy tolerate more error (masking)

• This can be represented as a time-dependent frequency weighting (=filtering)

• W(z) may be implemented using the STPas �%&'*�%&*�'

Uen Rev PA1 2002-05-07 35

4��5��

� �

( / )

( / ),

��

� �1

22 10 1� � �

0 500 1000 1500 2000 2500 3000 3500 4000-30

-20

-10

0

10

20

30

40

Higher distortion allowed

)*�%&'

7%&'

Error to be reduced here

Hz

Uen Rev PA1 2002-05-07 36

4��5��>�8��

� �

( / )

( / ),

��

� �1

22 10 1� � �

0 500 1000 1500 2000 2500 3000 3500 4000-30

-20

-10

0

10

20

30

40

Distortion will be moved to formants/high energy regions

)*�%&'

7%&'

� ��%&'

�7 ��%&'

�%&'

�%&'7%&'

Hz

Uen Rev PA1 2002-05-07 37

�� 6 �� >��,

Output

Input

W(z)

preprocessing

WMSE

Z-LAG

1/A(z)noise

Minimize

Weighted Error

Uen Rev PA1 2002-05-07 38

��6 ��

• WMSE, Weighted Mean Square Error is minimized– Weighting is performed in advance– i.e the inner minimization loop is still MSE, Which allows for fast

DSP algorithms.

• To avoid glitches, synthesis and weighting filters have memory, ( operates continuously)

• Frames divided into smaller subframes– pitch parameters need to be updated every 2-5 ms– complexity reduced

Uen Rev PA1 2002-05-07 39

�� 6 �� >��-��B��#��

preprocessing

Pitch 1/A(z) W(z)

FCB 1/A(z) W(z)

LPC

-

-

output

W(z)

input

error

A(z)residual

Note: 5 pulse EFR used for sound

Uen Rev PA1 2002-05-07 40

��7��>��

• To further enhance quality post filtering is employed

• Formant post filtering – Enhances formant structure– Coding Noise is hidden in formant regions

• Pitch structure postfiltering– Encoding Noise is hidden by pitch fine structure peaks

• Drawbacks with postfiltering:– Distorts the signal– Signal becomes destroyed in tandeming scenarios

Uen Rev PA1 2002-05-07 41

��7��>�8�C

To enhance quality in presence of errors Error Concealment is employed

• Frame errors – Extrapolate LPC model– Extrapolate excitation memory from adaptive Codebook– Provide some innovation for signal continuity– Update decoder quantizer states to avoid clicks and pops

• Bit errors (very low BER)– Stabilize excitation in stationary segments.

• Drawbacks with Error concealmet– Does not work for onsets– Only efficient for 30-60 ms error bursts

Uen Rev PA1 2002-05-07 42

4��8(��

��1 ��^

+

��

��

��

��

��

��

��

�� ^

Note: No pitch contribution used to make the signal noisy in this presentation, good loudspeakers req.

Stronger post filter needed for a really good demo

Uen Rev PA1 2002-05-07 43

8��8(��

��1 ��^

+

��

��

��

��

��

��

��

�� ^

Lag

LPC=prev_LPC

Lag=prev_lag;

��=max(prev_��,.95)

If(��>0.5) {

��=0

} else {

�%�' = noise;

��= prev_��

}

Uen Rev PA1 2002-05-07 44

8��=��85� )/%/)>�*+%+�;6��

��

"��4��5��

&4��

5��8��

��5��

(

��

��

��6��;

��(

9

��

&4�

� ��5��

��5��

��

&�4��

&�4D��#�� &��

&��

��8&4��

�� !�"��#�"��

4��

�!#$��

Uen Rev PA1 2002-05-07 45

8��.�� =��85�>�!0�;6��$

��

"��4��5��

&4��

5��8��

��5��

(

��

��

��6��;

��(

9

��

&4�

� ��5��

��5��

��

&�4��

&�4D��#�� &��

&��

��8&4��

$�� !�"��#�"��

4��

Uen Rev PA1 2002-05-07 46

8��.�� =��85�>��A

��

"��4��5��

&4��

5��8��

��5��

(

��

��

��6��;

��(

9

��%�

&4�

� ��5��

��5��

��

&�4��

&�4D��#�� &��

&��

��8&4��

$�� !�"��#�"��

4��

Uen Rev PA1 2002-05-07 47

�� !��$ ��

• Improved speech quality using multimode approach• Continuously trade channel and source coding• Possibility to trade speech quality and capacity smoothly and flexibly

by codec mode adaptation• Link adaption mechanism is required for measuring channel quality

and selecting speech codec modes, solves problems created by GSM slow power control loop

• Speech compression improvements– ACELP based on GSM EFR and IS-641 codecs.– Lower rates use phase dispersion postfiltering– Noise encoding is improved through WMSE+ energy criteria– LTP parameter encoding improved– ’optimal’ sorting of speech bits for bit error robustness’

Uen Rev PA1 2002-05-07 48

��4��

0,00

5,00

10,00

15,00

20,00

25,00

30,00

��

AMR-FR e nve lope

AMR-HR e nve lope

EFR

HR

Introduce improvements where they are needed– low C/I in FR mode– high C/I in HR mode.

Uen Rev PA1 2002-05-07 49

��>�*022

�� ! ��!"! ��!# ��$ �� #! ��"%& ��"&&LPC 23 bit 26 bit 27 bit 26 bit 38 bit

W. filt �1=0.94, �2=0.6 �1=0.9, �2=0.61/20ms 1 search /10msOL-LTPSearch range 20..143 1) 18..143Fractional search 1/3 1/6CL-LTP8-4-4-4 2) 8-4-8-4 2) 8-5-8-52) 8-6-8-62) 8-5-8-52) 9-6-9-62)

Alg. CB 2p, 9b/5ms 2p, 11b 3p, 14b 4p, 17b 8p, 31b 10p, 35b

CB gainQ 5b/5ms 3) 5b/5ms

LTP gainQVQ8b/10ms

VQ6bit/5ms

VQ7bit/5ms 4b/5ms 3)

VQ7b/5ms 4b/5ms

APD Yes No Yes mod No No

Post filt IS-641 EFR

Post HP IS-641 IS-641 4)

Bits/frame 95 103 118 134 148 159 204 244

1) MR102 has a different search criteria2) bits per each 5ms sub frame, high number absolute coding, low delta coding3) modified search criteria4) in case of adaptation

Uen Rev PA1 2002-05-07 50

��E��;��7��

1. EFR (female+male+female+male)

2. AMR (female+male+female+male)

0 10 20 30 40 50 600

5

10

15

20

25

30C/(I+N) profile es11

time [s]

Cha

nnel

: C/(

I+N

) [d

B]

DLUL

Uen Rev PA1 2002-05-07 51

��6��%��6��

100 1000 10000-10

0

10

20

30

40

50

60

70

80

90

100

Sou

nd p

ress

ure

leve

l

Frequency (Hz)

WB-Speech (50 - 7000 Hz)

Hearing threshold Hearing threshold function of human function of human auditory systemauditory system

NB-Speech(200 - 3600 Hz)

Uen Rev PA1 2002-05-07 52

D��;�� A�� 'A

Uen Rev PA1 2002-05-07 53

��A�F'A

Uen Rev PA1 2002-05-07 54

A��8(��:

Bandwidth Extension is

– ��B��

!�%�%�,-))�1)))�"#$��

– ��B��

!�%�%�.)�,))�"#$

from the information in the narrowband speech (e.g. 300-3400 Hz).

0 1000 2000 3000 4000 5000 6000 7000 8000-20

0

20

40

60

80

100

120

'�� '�� (��

�� )��*��+��

,��-+��

Uen Rev PA1 2002-05-07 55

�� A��8(��

• Applicable to old standards• Enhanced intelligibility• Nice feature in phones

Uen Rev PA1 2002-05-07 56

5��7��

• Encoder side input modifications • Make the signal easier to model/encode

• Decoder side output modification• Use signal modifications to improve average signal quality

• Variable Rate Wideband-Speech • Extremely low rate coding (>2kpbs)

Uen Rev PA1 2002-05-07 57

5��

• “A practical handbook of Speech Coders”, Goldberg and Riek

• “Speech coding: A Tutorial Review”, Andreas S. Spanias• comp.speech• http://www.data-compression.com/

• Ericsson Research, Multimedia Technologies, Jonas Svedberg, [email protected]

Date post:	30-Apr-2020
Category:	Documents
Upload:	others
View:	1 times
Download:	0 times

ltu 020508 speech coding overview to pdf · • Decoded Speech – encoded by PCM or output in a...

Documents