+ All Categories
Home > Documents > Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Date post: 14-Dec-2015
Category:
Upload: keeley-hemby
View: 234 times
Download: 3 times
Share this document with a friend
Popular Tags:
32
Speech & Audio Coding Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg
Transcript
Page 1: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Speech & Audio CodingSpeech & Audio Coding

TSBK01 Image Coding and Data Compression

Lecture 11, 2003

Jörgen Ahlberg

Page 2: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

OutlineOutline

• Part I - Speech

– Speech

– History of speech synthesis & coding

– Speech coding methods

• Part II – Audio

– Psychoacoustic models

– MPEG-4 Audio

Page 3: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Speech ProductionSpeech Production

• The human’s vocal apparatus consists of:

– lungs

– trachea (wind pipe)

– larynx

• contains 2 folds of skin called vocal cords which blow apart and flap together as air is forced through

– oral tract

– nasal tract

Page 4: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

1

The Speech SignalThe Speech Signal

Page 5: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

The Speech SignalThe Speech Signal

Page 6: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Elements of the speech signal:• spectral resonance (formants, moving)• periodic excitation (voicing, pitched) + pitch contour• noise excitation (fricatives, unvoiced, no pitch)• transients (stop-release bursts)• amplitude modulation (nasals, approximants) • timing

The Speech SignalThe Speech Signal

Page 7: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Vowels - characterised by formants; generally voiced; Tongue & lips - effect of rounding. Examples of vowels: a, e, i, o, u, a, ah, oh. Vibration of vocal cords: male 50 - 250Hz, female up to 500Hz. Vowels have in average much longer duration than consonants. Most of the acoustic energy of a speech signal is carried by vowels.

F1-F2 chart Formant positions

The Speech SignalThe Speech Signal

Page 8: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

• 1939 - Channel vocoder - first analysis-by-synthesis system developed by Homer Dudley of AT&T labs - VODER

• 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962

VODER – the architecture

History of Speech CodingHistory of Speech Coding

Page 9: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

• 1939 - Channel vocoder - first analysis - by - synthesis system developed by Homer Dudley of AT&T labs - VODER

• 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962

History of Speech CodingHistory of Speech Coding

Page 10: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

OVE formant synthesis (Gunnar Fant, KTH), 1953

Page 11: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

History of Speech - Coding

• 1939 - Channel vocoder - first analysis - by - synthesis system Homer Dudley of AT&T labs - VODER

• 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962

• 1957 - -law encoding proposed (standardised for telephone network in 1972 (G.711))

• 1952 - delta modulation proposed, differential PCM invented

• 1974 - ADPCM developed

• 1984 - CELP vocoder proposed (majority of coding standards for speech signal today use a variation on CELP)

Page 12: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

• Signal from a source is filtered by a time-varying filter with resonant properties similar to that of the vocal tract.

• The gain controls Av and AN determine the intensity of voiced and unvoiced excitation.

• The frequency of higher formant are attenuated by -12 dB/octave (due to the nature of our speech organs).

• This is an over simplified model for speech production. However, it is very often adequate for understanding the basic principles.

Source-filter Model of Speech ProductionSource-filter Model of Speech Production

Page 13: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Speech Coding StrategiesSpeech Coding Strategies

1. PCM

• Invented 1926, deployed 1962.

• The speech signal is sampled at 8 kHz.

• Uniform quantization requires >10 bits/sample.

• Non-uniform quantization (G.711, 1972)

• Quantizing y to 8 bits -> 64 kbit/s.

Page 14: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Speech Coding StrategiesSpeech Coding Strategies

2. Adaptive DPCM

• Example: G.726 (1974)

• Adaptive predictor based on six previous differences.

• Gain-adaptive quantizer with 15 levels ) 32 kbit/s.

Page 15: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Speech Coding StrategiesSpeech Coding Strategies

3. Model-based Speech Coding

• Advanced speech coders are based on models of how speech is produced:

Excitationsource

Vocaltract

Page 16: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

An Excitation SourceAn Excitation Source

Noisegenerator

Pulsegenerator

Pitch

Page 17: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Vocal Tract Filter 1: A Fixed Filter BankVocal Tract Filter 1: A Fixed Filter Bank

BP

g1

BP

g2

BP

gn

Page 18: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Vocal Tract Filter 2: A Controllable FilterVocal Tract Filter 2: A Controllable Filter

Page 19: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Linear Predictive Coding (LPC)Linear Predictive Coding (LPC)

• The controllable filter is modelled as

yn = ai yn-i + Gn

where n is the input signal and yn is the output.

• We need to estimate the vocal tract parameters (ai and G) and the exciatation parameters (pitch, v/uv).

• Typically the source signal is divided in short segments and the parameters are estimated for each segment.

• Example: The speech signal is sampled at 8 kHz and divided in segments of 180 samples (22.5 ms/segment).

Page 20: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Typical Scheme of an LPC CoderTypical Scheme of an LPC Coder

Noisegenerator

Pulsegenerator

Pitch

Vocal tractfilter

v/uv Gain Filter coeffs

Page 21: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Estimating the ParametersEstimating the Parameters

• v/uv estimation

– Based on energy and frequency spectrum.

• Pitch-period estimation

– Look for periodicity, either via the a.c.f our some other measure, for example

that gives you a minimum value when p equals the pitch period.

– Typical pitch-periods: 20 - 160 samples.

Page 22: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Estimating the ParametersEstimating the Parameters

• Vocal tract filter estimation

– Find the filter coefficients that minimize the error

2 = ( yn - ai yn-i + Gn )2

– Compare to the computation of optimal predictors (Lecture 7).

Page 23: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Estimating the ParametersEstimating the Parameters

• Assuming a stationary signal:

where R and p contain acf values.

• This is called the autocorrelation method.

Page 24: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Estimating the ParametersEstimating the Parameters

• Alternatively, in case of a non-stationary signal:

where

• This is called the autocovariance method.

Page 25: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

ExampleExample

• Coding of parameters using LPC10 (1984):

v/uv 1 bit

Pitch 6 bits

Voiced filter 46 bits

Unvoiced filter 46 bits

Synchronization 1 bit

Sum:Sum: 54 bits 54 bits )) 2.4 kbit/s 2.4 kbit/s

Page 26: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

The Vocal Tract FilterThe Vocal Tract Filter

• Different representations:

– LPC parameters

– PARCOR (Partial Correlation Coefficients)

– LSF (Line Spectrum Frequencies)

Page 27: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

• LPC analysis ) V(z) • Define perceptual weighting filter. This permits more noise at formant frequencies where it will be masked by the speech• Synthesise speech using each codebook entry in turn as the input to V(z)

• Calculate optimum gain to minimise perceptually weighted error energy in speech frame• Select codebook entry that gives lowest error

Decoding: • Receive LPC parameters and codebook index• Re-synthesise speech using V(z) and codebook entry

Encoding:

• Transmit LPC parameters and codebook index

Performance:• 16kbit/s: MOS=4.2, Delay=1.5 ms, 19 MIPS• 8 kbit/s: MOS=4.1, Delay=35 ms, 25 MIPS • 2.4kbit/s: MOS=3.3, Delay=45 ms, 20 MIPS

Code Excited Linear Prediction Coding (CELP)Code Excited Linear Prediction Coding (CELP)

Page 28: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

ExamplesExamples

• G.728– V(z) is chosen as a large FIR-filter (M ¼ 50).

– The gain and FIR-parametrers are estimated recursively from previously received samples.

– The code book contains 127 sequences.

• GSM– The code book contains regular pulse trains with variabel

frequency and amplitudes.

• MELP– Mixed excitation linear prediction

– The code book is combined with a noise generator.

Page 29: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Other VariationsOther Variations

• SELP – Self Excited Linear Prediction

• MPLP – Multi-Pulse Excited Linear Prediction

• MBE – Multi-Band Excitation Coding

Page 30: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

Quality LevelsQuality Levels

Quality level Bandwidth Bitrate

Broadcast quality 10 kHz >64 kbit/s

Network (tool) quality 300 – 3400 kHz 16 – 64 kbit/s

Communication quality 4 – 16 kbit/s

Synthetic quality <4 kbit/s

Page 31: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

• MOS (Mean Opinion Score): result of averaging opinions scores for a set of between 20 – 60 untrained subjects.

• They rate the quality 1 to 5 (1-bad, 2-poor, 3-fair, 4-good, 5-excellent).

• MOS of 4 or higher defines good or tool quality (network quality) - reconstructed signal generally indistinguishable from the original.

• MOS between 3.5 – 4.0 defines communication quality – telephone communications

• MOS between 2.5 – 3.5 implies synthetic quality

• In digital communications speech quality is classified into four general categories, namely: broadcast, network or toll, communications, and synthetic.

• Broadcast wideband speech – high quality ”commentary” speech – generally achieved at rates above 64 kbits/s.

Subjective AssessmentSubjective Assessment

Page 32: Speech & Audio Coding TSBK01 Image Coding and Data Compression Lecture 11, 2003 Jörgen Ahlberg.

• DRT (Diagnostic Rhyme Test): listeners should recognise one of the two possible words in a set of rhyming pairs (e.g. meatl/heat)

• DAM (Diagnostic Acceptability Measure) - trained listeners judge various factors e.g. muffledness, buzziness, intelligibility

Quality versus data rate (8kHz sampling rate)

Subjective AssessmentSubjective Assessment


Recommended