Uen Rev PA1 2002-05-07 1
������������ ����������
Jonas SvedbergAWARE, Advanced Wireless Algorithm Research and
ExperimentsMultimedia Technologies, Ericsson Research
Uen Rev PA1 2002-05-07 2
������
• Who am I• ����������� �� ������ ��������������
• ��� ������ ���������������
• ��� ������� ���������
• ������������������������������������ �
• ������ ������� !�"#
• "�������� ����������������
Uen Rev PA1 2002-05-07 3
��������
• M Sc in Engineering Physics (LTU –93)• Research Engineer at Ericsson Research (1993- )• Still working with speech coding
– Development– Verification– Standardization, (Europe/Japan/USA, ETSI/ARIB/3GPP2)
Uen Rev PA1 2002-05-07 4
���� ������������������
Basic paradigm – trade bits for processing/transmission power
Lossless coding Lossy coding
Zip shorten … Speech (CELP)
Audio (mp3, AAC)
Video (MPEG1,2)
������ ������������� ������ �������� �����������������������
Compression
Uen Rev PA1 2002-05-07 5
������ ������ ���������� ��������� ���������
• Compression -> less bits to transmit• Less bits to transmit -> less channel interference• Less channel interference -> more users in one cell• More users in one cell -> money to the operator
���������!"#$
%�%�%
� �
����&��
�
Interference limit
�������� '��������
Uen Rev PA1 2002-05-07 6
������ ������ ���������� ��������� ������������
• Compression -> less bits to transmit• Less bits to transmit -> speech sent on smaller channels• Smaller channels -> more channels• More channels -> more traffic• More traffic -> more money to the operator
Example: Multiplexing gains• PSTN: 64kbps PCM -> 8*8kbps ITU/G.729, 8xRevenue• Cellular: 13kbps GSMFR -> 2*[email protected], 2 x Revenue
Uen Rev PA1 2002-05-07 7
���������%���������
• Use a mixture of modeling and waveform compression to re-create original sound as accurately as possible without introducing artifacts, using as few bits as possible.
Audio coding –complex input signals
mainly use characteristics ofthe auditory system
Speech Coding – use model characteristics
from speech production organs andauditory system characteristics
Uen Rev PA1 2002-05-07 8
��������������������������
• Conversational speech is delay sensitive– Speech coding algorithms use small buffers, typically 20-30 ms
• Conversational speech is by nature a Real time service– Speech coding is(was) complexity limited
• None of the above drawbacks exists for the normal Audio coding Application
Uen Rev PA1 2002-05-07 9
�������������������� "����
First generation (64-32 kbit/s)Waveform coding, very low complexity, Wireline quality
Second generation (16-8 kbps)(First cellular codecs. Low complexity, Communications quality)
Third generation (13-4.0 kbps)Second Generation Cellular/wireline, Medium to high complexity,Wireline to communications quality
Year
1972-1986
1986-1992
1993-
Fourth generation 8.0-2.0 kbps
To be developed
Uen Rev PA1 2002-05-07 10
������������������(� ���������
)
*)
+)
,)
-)
.)
/)
*01) *02) *00) +))) +)*)
3���
������
���4����
������#����
5(�
��������
Uen Rev PA1 2002-05-07 11
������������4�����
• Model the speech production apparatus for efficient compression
• Only code and transmit perceptually relevant information• Hide quantization noise as much as possible• Combine algorithms to keep artifacts at an absolute
minimum• Cellular algorithms:
– Also need to provide robustness against errors, limits use of error sensitive backward prediction
Uen Rev PA1 2002-05-07 12
������������������6���� ����
�����������
�����������
�������7�����
�������7�����
�������8�����
7����������
�����������
9
�������� ����������
�����������
Uen Rev PA1 2002-05-07 13
��6���� ������������
• Input signal– Talker + acoustic interference
(noise, strange non-speech inputs, tones, music)
• Channel– Unreliable transport -> Frame and bit errors
• Decoded Speech– encoded by PCM or output in a small (lousy) speaker
Uen Rev PA1 2002-05-07 14
�� ������� ���������:
• Technology Evolution makes new standards desirable• Codecs have been optimized for a given transport channel• All applications do not have the same quality demands
– E.g. Military Comm. Needs intelligibility, encryption– Satellite Comm. Needs channels
• Political reasons (lower licensing costs)• One standards organization per continent.
Uen Rev PA1 2002-05-07 15
�����������;������8&4�����
• Speech Modeling– Short term Prediction– Long term prediction
• Analysis by Synthesis• Error Weighting• Codebooks
– Adaptive– Fixed
• Post filtering• Error Concealment• The GSMEFR Algorithm
Uen Rev PA1 2002-05-07 16
����������������������
Uen Rev PA1 2002-05-07 17
<���������������
• Vocal tract physically Modeled as concatenated tubes with varying cross-sections
• Mathematical modeled as a Lattice filter• Lattice filter may be converted to an all-pole (IIR model)$%&'()*�%&'+���%&'()!��%���&!�'+��()���
• ,��� ������������� ������-������� ����.�����������+���� ���%�'�(���%��� � ��� %�!�''�+��()���
• #�/��#�-���+��� ���+�01������������������� �������2�)345
Uen Rev PA1 2002-05-07 18
<�����������������(�����!*$
• 10th order STP predictor
$%&'()*�%&'+
'����!=������>�8���� ���������$
������
Uen Rev PA1 2002-05-07 19
<�����������������(�����!+$
• 10th order STP predictor
$%&'()*�%&'+
�����8(������
������
Uen Rev PA1 2002-05-07 20
��������&������������• LPC analyzes the speech signal by estimating the formants• The LPC parameters are transmitted and used as an input to LPC
synthesis in the receiver end• Because speech signals vary, LPC is done in short frames, normally
25 to 100 frames per second.
Uen Rev PA1 2002-05-07 21
&�����4������ ?����?
0 50 100 150 200 250-1
-0.5
0
0.5
1Waveform
0 500 1000 1500 2000 2500 3000 3500 4000-60
-40
-20
0
20
40
Am
plitu
de (
dB)
Frequency (Hz)
LPC and FFT
Uen Rev PA1 2002-05-07 22
&4���(������� �����������
0 500 1000 1500 2000 2500 3000 3500 4000-20
-10
0
10
20
Frequency (Hz)
Am
plitu
de (
dB)
LPC spectrum
0 20 40 60 80 100 120 140 160 180-500
0
500
Waveform
Uen Rev PA1 2002-05-07 23
4����4�������@ &�������������������• Vocal cords produces the signal, which is characterized by its
intensity (loudness) and frequency (pitch).
• Long term correlation is represented by lag. Lag is the number of samples between long-term periods in continuous signal.
• The range of lag values for range between 16-145 corresponding to the frequency range 500-55 Hz.
0 1 0 2 0 3 0 4 0 5 0- 0 .5
- 0 .4
- 0 .3
- 0 .2
- 0 .1
0
0 .1
0 .2
0 .3
0 .4
0 .5
T i m e ( m s )
Am
plitu
de
& � = �
Uen Rev PA1 2002-05-07 24
�����������������������
• Different coder structures exist– waveform coding
• (may code almost any input signal) – Vocoding
• (speech specific and parametric)– Hybrid coding
• (speech specific but with waveform matching capabilities)– Parametric+hybrid
• Very low rate codecs switch between optimal structures
• Currently ���������������!-����������-������ �(LPAS) coders offer best performance
Uen Rev PA1 2002-05-07 25
4������������������8(�����
• DPCM (Differential Pulse Code Modulation)
Q
P(z)
P(z)+
-
Output
�����
Input
����Q-1 Q-1
index
channel
+
e
��������
�����
Uen Rev PA1 2002-05-07 26
8��� �6��� �������������!<������$
Noise
1/A(z)
Uen Rev PA1 2002-05-07 27
8��� �6��� �������������!<������$>�����%
• Binary model -> sensitive to excitation source selection– Pulse train model not generic enough – Noise model only good for unvoiced sounds
Uen Rev PA1 2002-05-07 28
" 6���8&4�������������!����>�*02-$
Noise-CB
Z-Lag
1/A(z)
Uen Rev PA1 2002-05-07 29
������(� �����#��" 6���8&4
������������Noise like
Z-Lag
1/A(z)
Subframedelay
+
This Pitch Model is normally referred to as an Adaptive CB
Uen Rev PA1 2002-05-07 30
A�����������&4���6 ��������
• Use a codec with various model parameters• Try synthesizing signals with different parameter values• Analyze which are the best parameter values
• Hence the name Linear Prediction ������!-�!��� �
• ,���������)356�-�������%����� ������� ���'
Uen Rev PA1 2002-05-07 31
���� ���6 �� ������>������*
Output
Input
()2�
Signal generator
MSE
Error
Uen Rev PA1 2002-05-07 32
���� ���6 �� ������>������+
Synthesis Output
Input
-noise
Z-LAG
1/A(z)
Minimize
Error
LP Ind.Pitch/ACB IndInnov/FCB Ind.
Channel
Uen Rev PA1 2002-05-07 33
�����������������+
• All parameters – (synthesis filter,pitch delay, gains) are time-variant
• All parameters are quantized before transmission• Mean Squared Error in the output is minimized (MSE)• Some preprocessing may be needed
– Bandwidth analysis (Codec has bandwidth limitations)– speech enhancement (Noise Suppression)
Uen Rev PA1 2002-05-07 34
��6 ��>�������,>�4�����������������
• Noise is not perceived uniformly– But MSE error provide a flat quantization noise spectrum
• Rule of thumb: frequency bands with more energy tolerate more error (masking)
• This can be represented as a time-dependent frequency weighting (=filtering)
• W(z) may be implemented using the STPas �%&'*�%&*�'
Uen Rev PA1 2002-05-07 35
4������������������5����
� �
( / )
( / ),
��
� �1
22 10 1� � �
0 500 1000 1500 2000 2500 3000 3500 4000-30
-20
-10
0
10
20
30
40
Higher distortion allowed
)*�%&'
7%&'
Error to be reduced here
Hz
Uen Rev PA1 2002-05-07 36
4������������������5����>�8�����������
� �
( / )
( / ),
��
� �1
22 10 1� � �
0 500 1000 1500 2000 2500 3000 3500 4000-30
-20
-10
0
10
20
30
40
Distortion will be moved to formants/high energy regions
)*�%&'
7%&'
� ��%&'
�7 ��%&'
�%&'
�%&'7%&'
Hz
Uen Rev PA1 2002-05-07 37
���� ���6 �� ������>������,
Output
Input
W(z)
preprocessing
WMSE
Z-LAG
1/A(z)noise
Minimize
Weighted Error
Uen Rev PA1 2002-05-07 38
��������������6 ��
• WMSE, Weighted Mean Square Error is minimized– Weighting is performed in advance– i.e the inner minimization loop is still MSE, Which allows for fast
DSP algorithms.
• To avoid glitches, synthesis and weighting filters have memory, ( operates continuously)
• Frames divided into smaller subframes– pitch parameters need to be updated every 2-5 ms– complexity reduced
Uen Rev PA1 2002-05-07 39
���� ���6 �� ������>������-���B������#�����
preprocessing
Pitch 1/A(z) W(z)
FCB 1/A(z) W(z)
LPC
-
-
output
W(z)
input
error
A(z)residual
Note: 5 pulse EFR used for sound
Uen Rev PA1 2002-05-07 40
������7�����>�������������
• To further enhance quality post filtering is employed
• Formant post filtering – Enhances formant structure– Coding Noise is hidden in formant regions
• Pitch structure postfiltering– Encoding Noise is hidden by pitch fine structure peaks
• Drawbacks with postfiltering:– Distorts the signal– Signal becomes destroyed in tandeming scenarios
Uen Rev PA1 2002-05-07 41
������7�����>�8�C
To enhance quality in presence of errors Error Concealment is employed
• Frame errors – Extrapolate LPC model– Extrapolate excitation memory from adaptive Codebook– Provide some innovation for signal continuity– Update decoder quantizer states to avoid clicks and pops
• Bit errors (very low BER)– Stabilize excitation in stationary segments.
• Drawbacks with Error concealmet– Does not work for onsets– Only efficient for 30-60 ms error bursts
Uen Rev PA1 2002-05-07 42
4������������8(������
���1 ����^
+
���
����
����
��
�����������
����� ��������� ��
������������
��������������� ���^
Note: No pitch contribution used to make the signal noisy in this presentation, good loudspeakers req.
Stronger post filter needed for a really good demo
Uen Rev PA1 2002-05-07 43
8��������������������������8(������
���1 ����^
+
���
����
����
��
�����������
����� ��������� ��
������������
��������������� ���^
Lag
LPC=prev_LPC
Lag=prev_lag;
��=max(prev_��,.95)
If(��>0.5) {
��=0
} else {
�%�' = noise;
��= prev_��
}
Uen Rev PA1 2002-05-07 44
8�������=��85� )/%/)>�*+%+�;6��
�����������
"���4���5������
&4����� ��
5�������8����
�������5������
(
����
���
���������6��;
���(
9
����
&4�
� ������5����
�������5������
���
&�4���������
&�4D����#���� �����&���
&���������
��8&4���������
���� ��!�"��#�"��
4��������
�!#$������
Uen Rev PA1 2002-05-07 45
8�������.������ =��85�>�!0�;6��$
�����������
"���4���5������
&4����� ��
5�������8����
�������5������
(
����
���
���������6��;
���(
9
����
&4�
� ������5����
�������5������
���
&�4���������
&�4D����#���� �����&���
&���������
��8&4���������
$�� ��!�"��#�"��
4��������
Uen Rev PA1 2002-05-07 46
8�������.������ =��85�>�����������A
�����������
"���4���5������
&4����� ��
5�������8����
�������5������
(
����
���
���������6��;
���(
9
����%�
&4�
� ������5����
�������5������
���
&�4���������
&�4D����#���� �����&���
&���������
��8&4���������
$�� ��!�"��#�"��
4��������
Uen Rev PA1 2002-05-07 47
��� !���������������$ �������
• Improved speech quality using multimode approach• Continuously trade channel and source coding• Possibility to trade speech quality and capacity smoothly and flexibly
by codec mode adaptation• Link adaption mechanism is required for measuring channel quality
and selecting speech codec modes, solves problems created by GSM slow power control loop
• Speech compression improvements– ACELP based on GSM EFR and IS-641 codecs.– Lower rates use phase dispersion postfiltering– Noise encoding is improved through WMSE+ energy criteria– LTP parameter encoding improved– ’optimal’ sorting of speech bits for bit error robustness’
Uen Rev PA1 2002-05-07 48
�������������4��������������
0,00
5,00
10,00
15,00
20,00
25,00
30,00
�� ���� � ��� � � ��� ��� � � �� �� ���
AMR-FR e nve lope
AMR-HR e nve lope
EFR
HR
Introduce improvements where they are needed– low C/I in FR mode– high C/I in HR mode.
Uen Rev PA1 2002-05-07 49
���������������������>�*022
��� ! ��!"! ��!# ��$ �� � �� #! ��"%& ��"&&LPC 23 bit 26 bit 27 bit 26 bit 38 bit
W. filt �1=0.94, �2=0.6 �1=0.9, �2=0.61/20ms 1 search /10msOL-LTPSearch range 20..143 1) 18..143Fractional search 1/3 1/6CL-LTP8-4-4-4 2) 8-4-8-4 2) 8-5-8-52) 8-6-8-62) 8-5-8-52) 9-6-9-62)
Alg. CB 2p, 9b/5ms 2p, 11b 3p, 14b 4p, 17b 8p, 31b 10p, 35b
CB gainQ 5b/5ms 3) 5b/5ms
LTP gainQVQ8b/10ms
VQ6bit/5ms
VQ7bit/5ms 4b/5ms 3)
VQ7b/5ms 4b/5ms
APD Yes No Yes mod No No
Post filt IS-641 EFR
Post HP IS-641 IS-641 4)
Bits/frame 95 103 118 134 148 159 204 244
1) MR102 has a different search criteria2) bits per each 5ms sub frame, high number absolute coding, low delta coding3) modified search criteria4) in case of adaptation
Uen Rev PA1 2002-05-07 50
������������E���;���7�����������
1. EFR (female+male+female+male)
2. AMR (female+male+female+male)
0 10 20 30 40 50 600
5
10
15
20
25
30C/(I+N) profile es11
time [s]
Cha
nnel
: C/(
I+N
) [d
B]
DLUL
Uen Rev PA1 2002-05-07 51
��6�����%�������6��
100 1000 10000-10
0
10
20
30
40
50
60
70
80
90
100
Sou
nd p
ress
ure
leve
l
Frequency (Hz)
WB-Speech (50 - 7000 Hz)
Hearing threshold Hearing threshold function of human function of human auditory systemauditory system
NB-Speech(200 - 3600 Hz)
Uen Rev PA1 2002-05-07 52
D��;������������ �A��� 'A
Uen Rev PA1 2002-05-07 53
�������������A�F'A
Uen Rev PA1 2002-05-07 54
A������8(�������:
Bandwidth Extension is
– �������B���� �������������
!�%�%�,-))�1)))�"#$���
– �������B���� �������������
!�%�%�.)�,))�"#$
from the information in the narrowband speech (e.g. 300-3400 Hz).
0 1000 2000 3000 4000 5000 6000 7000 8000-20
0
20
40
60
80
100
120
'������ �� � '���� (����
����� �)��������*���+��� �����
,����-+��� �����
Uen Rev PA1 2002-05-07 55
�� �A������8(������
• Applicable to old standards• Enhanced intelligibility• Nice feature in phones
Uen Rev PA1 2002-05-07 56
5������������������7����������
• Encoder side input modifications • Make the signal easier to model/encode
• Decoder side output modification• Use signal modifications to improve average signal quality
• Variable Rate Wideband-Speech • Extremely low rate coding (>2kpbs)
Uen Rev PA1 2002-05-07 57
5������ ���������
• “A practical handbook of Speech Coders”, Goldberg and Riek
• “Speech coding: A Tutorial Review”, Andreas S. Spanias• comp.speech• http://www.data-compression.com/
• Ericsson Research, Multimedia Technologies, Jonas Svedberg, [email protected]