Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Post on 14-Dec-2015

234 views 3 download

Tags:

transcript

Speech & Audio CodingSpeech & Audio Coding

TSBK01 Image Coding and Data Compression

Lecture 11, 2003

Jörgen Ahlberg

OutlineOutline

• Part I - Speech

– Speech

– History of speech synthesis & coding

– Speech coding methods

• Part II – Audio

– Psychoacoustic models

– MPEG-4 Audio

Speech ProductionSpeech Production

• The human’s vocal apparatus consists of:

– lungs

– trachea (wind pipe)

– larynx

• contains 2 folds of skin called vocal cords which blow apart and flap together as air is forced through

– oral tract

– nasal tract

1

The Speech SignalThe Speech Signal

The Speech SignalThe Speech Signal

Elements of the speech signal:• spectral resonance (formants, moving)• periodic excitation (voicing, pitched) + pitch contour• noise excitation (fricatives, unvoiced, no pitch)• transients (stop-release bursts)• amplitude modulation (nasals, approximants) • timing

The Speech SignalThe Speech Signal

Vowels - characterised by formants; generally voiced; Tongue & lips - effect of rounding. Examples of vowels: a, e, i, o, u, a, ah, oh. Vibration of vocal cords: male 50 - 250Hz, female up to 500Hz. Vowels have in average much longer duration than consonants. Most of the acoustic energy of a speech signal is carried by vowels.

F1-F2 chart Formant positions

The Speech SignalThe Speech Signal

• 1939 - Channel vocoder - first analysis-by-synthesis system developed by Homer Dudley of AT&T labs - VODER

• 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962

VODER – the architecture

History of Speech CodingHistory of Speech Coding

• 1939 - Channel vocoder - first analysis - by - synthesis system developed by Homer Dudley of AT&T labs - VODER

• 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962

History of Speech CodingHistory of Speech Coding

OVE formant synthesis (Gunnar Fant, KTH), 1953

History of Speech - Coding

• 1939 - Channel vocoder - first analysis - by - synthesis system Homer Dudley of AT&T labs - VODER

• 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962

• 1957 - -law encoding proposed (standardised for telephone network in 1972 (G.711))

• 1952 - delta modulation proposed, differential PCM invented

• 1974 - ADPCM developed

• 1984 - CELP vocoder proposed (majority of coding standards for speech signal today use a variation on CELP)

• Signal from a source is filtered by a time-varying filter with resonant properties similar to that of the vocal tract.

• The gain controls Av and AN determine the intensity of voiced and unvoiced excitation.

• The frequency of higher formant are attenuated by -12 dB/octave (due to the nature of our speech organs).

• This is an over simplified model for speech production. However, it is very often adequate for understanding the basic principles.

Source-filter Model of Speech ProductionSource-filter Model of Speech Production

Speech Coding StrategiesSpeech Coding Strategies

1. PCM

• Invented 1926, deployed 1962.

• The speech signal is sampled at 8 kHz.

• Uniform quantization requires >10 bits/sample.

• Non-uniform quantization (G.711, 1972)

• Quantizing y to 8 bits -> 64 kbit/s.

Speech Coding StrategiesSpeech Coding Strategies

2. Adaptive DPCM

• Example: G.726 (1974)

• Adaptive predictor based on six previous differences.

• Gain-adaptive quantizer with 15 levels ) 32 kbit/s.

Speech Coding StrategiesSpeech Coding Strategies

3. Model-based Speech Coding

• Advanced speech coders are based on models of how speech is produced:

Excitationsource

Vocaltract

An Excitation SourceAn Excitation Source

Noisegenerator

Pulsegenerator

Pitch

Vocal Tract Filter 1: A Fixed Filter BankVocal Tract Filter 1: A Fixed Filter Bank

BP

g1

BP

g2

BP

gn

Vocal Tract Filter 2: A Controllable FilterVocal Tract Filter 2: A Controllable Filter

Linear Predictive Coding (LPC)Linear Predictive Coding (LPC)

• The controllable filter is modelled as

yn = ai yn-i + Gn

where n is the input signal and yn is the output.

• We need to estimate the vocal tract parameters (ai and G) and the exciatation parameters (pitch, v/uv).

• Typically the source signal is divided in short segments and the parameters are estimated for each segment.

• Example: The speech signal is sampled at 8 kHz and divided in segments of 180 samples (22.5 ms/segment).

Typical Scheme of an LPC CoderTypical Scheme of an LPC Coder

Noisegenerator

Pulsegenerator

Pitch

Vocal tractfilter

v/uv Gain Filter coeffs

Estimating the ParametersEstimating the Parameters

• v/uv estimation

– Based on energy and frequency spectrum.

• Pitch-period estimation

– Look for periodicity, either via the a.c.f our some other measure, for example

that gives you a minimum value when p equals the pitch period.

– Typical pitch-periods: 20 - 160 samples.

Estimating the ParametersEstimating the Parameters

• Vocal tract filter estimation

– Find the filter coefficients that minimize the error

2 = ( yn - ai yn-i + Gn )2

– Compare to the computation of optimal predictors (Lecture 7).

Estimating the ParametersEstimating the Parameters

• Assuming a stationary signal:

where R and p contain acf values.

• This is called the autocorrelation method.

Estimating the ParametersEstimating the Parameters

• Alternatively, in case of a non-stationary signal:

where

• This is called the autocovariance method.

ExampleExample

• Coding of parameters using LPC10 (1984):

v/uv 1 bit

Pitch 6 bits

Voiced filter 46 bits

Unvoiced filter 46 bits

Synchronization 1 bit

Sum:Sum: 54 bits 54 bits )) 2.4 kbit/s 2.4 kbit/s

The Vocal Tract FilterThe Vocal Tract Filter

• Different representations:

– LPC parameters

– PARCOR (Partial Correlation Coefficients)

– LSF (Line Spectrum Frequencies)

• LPC analysis ) V(z) • Define perceptual weighting filter. This permits more noise at formant frequencies where it will be masked by the speech• Synthesise speech using each codebook entry in turn as the input to V(z)

• Calculate optimum gain to minimise perceptually weighted error energy in speech frame• Select codebook entry that gives lowest error

Decoding: • Receive LPC parameters and codebook index• Re-synthesise speech using V(z) and codebook entry

Encoding:

• Transmit LPC parameters and codebook index

Performance:• 16kbit/s: MOS=4.2, Delay=1.5 ms, 19 MIPS• 8 kbit/s: MOS=4.1, Delay=35 ms, 25 MIPS • 2.4kbit/s: MOS=3.3, Delay=45 ms, 20 MIPS

Code Excited Linear Prediction Coding (CELP)Code Excited Linear Prediction Coding (CELP)

ExamplesExamples

• G.728– V(z) is chosen as a large FIR-filter (M ¼ 50).

– The gain and FIR-parametrers are estimated recursively from previously received samples.

– The code book contains 127 sequences.

• GSM– The code book contains regular pulse trains with variabel

frequency and amplitudes.

• MELP– Mixed excitation linear prediction

– The code book is combined with a noise generator.

Other VariationsOther Variations

• SELP – Self Excited Linear Prediction

• MPLP – Multi-Pulse Excited Linear Prediction

• MBE – Multi-Band Excitation Coding

Quality LevelsQuality Levels

Quality level Bandwidth Bitrate

Broadcast quality 10 kHz >64 kbit/s

Network (tool) quality 300 – 3400 kHz 16 – 64 kbit/s

Communication quality 4 – 16 kbit/s

Synthetic quality <4 kbit/s

• MOS (Mean Opinion Score): result of averaging opinions scores for a set of between 20 – 60 untrained subjects.

• They rate the quality 1 to 5 (1-bad, 2-poor, 3-fair, 4-good, 5-excellent).

• MOS of 4 or higher defines good or tool quality (network quality) - reconstructed signal generally indistinguishable from the original.

• MOS between 3.5 – 4.0 defines communication quality – telephone communications

• MOS between 2.5 – 3.5 implies synthetic quality

• In digital communications speech quality is classified into four general categories, namely: broadcast, network or toll, communications, and synthetic.

• Broadcast wideband speech – high quality ”commentary” speech – generally achieved at rates above 64 kbits/s.

Subjective AssessmentSubjective Assessment

• DRT (Diagnostic Rhyme Test): listeners should recognise one of the two possible words in a set of rhyming pairs (e.g. meatl/heat)

• DAM (Diagnostic Acceptability Measure) - trained listeners judge various factors e.g. muffledness, buzziness, intelligibility

Quality versus data rate (8kHz sampling rate)

Subjective AssessmentSubjective Assessment