Date post: | 14-Dec-2015 |
Category: |
Documents |
Upload: | keeley-hemby |
View: | 234 times |
Download: | 3 times |
Speech & Audio CodingSpeech & Audio Coding
TSBK01 Image Coding and Data Compression
Lecture 11, 2003
Jörgen Ahlberg
OutlineOutline
• Part I - Speech
– Speech
– History of speech synthesis & coding
– Speech coding methods
• Part II – Audio
– Psychoacoustic models
– MPEG-4 Audio
Speech ProductionSpeech Production
• The human’s vocal apparatus consists of:
– lungs
– trachea (wind pipe)
– larynx
• contains 2 folds of skin called vocal cords which blow apart and flap together as air is forced through
– oral tract
– nasal tract
1
The Speech SignalThe Speech Signal
The Speech SignalThe Speech Signal
Elements of the speech signal:• spectral resonance (formants, moving)• periodic excitation (voicing, pitched) + pitch contour• noise excitation (fricatives, unvoiced, no pitch)• transients (stop-release bursts)• amplitude modulation (nasals, approximants) • timing
The Speech SignalThe Speech Signal
Vowels - characterised by formants; generally voiced; Tongue & lips - effect of rounding. Examples of vowels: a, e, i, o, u, a, ah, oh. Vibration of vocal cords: male 50 - 250Hz, female up to 500Hz. Vowels have in average much longer duration than consonants. Most of the acoustic energy of a speech signal is carried by vowels.
F1-F2 chart Formant positions
The Speech SignalThe Speech Signal
• 1939 - Channel vocoder - first analysis-by-synthesis system developed by Homer Dudley of AT&T labs - VODER
• 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962
VODER – the architecture
History of Speech CodingHistory of Speech Coding
• 1939 - Channel vocoder - first analysis - by - synthesis system developed by Homer Dudley of AT&T labs - VODER
• 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962
History of Speech CodingHistory of Speech Coding
OVE formant synthesis (Gunnar Fant, KTH), 1953
History of Speech - Coding
• 1939 - Channel vocoder - first analysis - by - synthesis system Homer Dudley of AT&T labs - VODER
• 1926 - PCM - first conceived by Paul M. Rainey and independently by Alex Reeves (AT&T Paris) in 1937. Deployed in US PSTN in 1962
• 1957 - -law encoding proposed (standardised for telephone network in 1972 (G.711))
• 1952 - delta modulation proposed, differential PCM invented
• 1974 - ADPCM developed
• 1984 - CELP vocoder proposed (majority of coding standards for speech signal today use a variation on CELP)
• Signal from a source is filtered by a time-varying filter with resonant properties similar to that of the vocal tract.
• The gain controls Av and AN determine the intensity of voiced and unvoiced excitation.
• The frequency of higher formant are attenuated by -12 dB/octave (due to the nature of our speech organs).
• This is an over simplified model for speech production. However, it is very often adequate for understanding the basic principles.
Source-filter Model of Speech ProductionSource-filter Model of Speech Production
Speech Coding StrategiesSpeech Coding Strategies
1. PCM
• Invented 1926, deployed 1962.
• The speech signal is sampled at 8 kHz.
• Uniform quantization requires >10 bits/sample.
• Non-uniform quantization (G.711, 1972)
• Quantizing y to 8 bits -> 64 kbit/s.
Speech Coding StrategiesSpeech Coding Strategies
2. Adaptive DPCM
• Example: G.726 (1974)
• Adaptive predictor based on six previous differences.
• Gain-adaptive quantizer with 15 levels ) 32 kbit/s.
Speech Coding StrategiesSpeech Coding Strategies
3. Model-based Speech Coding
• Advanced speech coders are based on models of how speech is produced:
Excitationsource
Vocaltract
An Excitation SourceAn Excitation Source
Noisegenerator
Pulsegenerator
Pitch
Vocal Tract Filter 1: A Fixed Filter BankVocal Tract Filter 1: A Fixed Filter Bank
BP
g1
BP
g2
BP
gn
Vocal Tract Filter 2: A Controllable FilterVocal Tract Filter 2: A Controllable Filter
Linear Predictive Coding (LPC)Linear Predictive Coding (LPC)
• The controllable filter is modelled as
yn = ai yn-i + Gn
where n is the input signal and yn is the output.
• We need to estimate the vocal tract parameters (ai and G) and the exciatation parameters (pitch, v/uv).
• Typically the source signal is divided in short segments and the parameters are estimated for each segment.
• Example: The speech signal is sampled at 8 kHz and divided in segments of 180 samples (22.5 ms/segment).
Typical Scheme of an LPC CoderTypical Scheme of an LPC Coder
Noisegenerator
Pulsegenerator
Pitch
Vocal tractfilter
v/uv Gain Filter coeffs
Estimating the ParametersEstimating the Parameters
• v/uv estimation
– Based on energy and frequency spectrum.
• Pitch-period estimation
– Look for periodicity, either via the a.c.f our some other measure, for example
that gives you a minimum value when p equals the pitch period.
– Typical pitch-periods: 20 - 160 samples.
Estimating the ParametersEstimating the Parameters
• Vocal tract filter estimation
– Find the filter coefficients that minimize the error
2 = ( yn - ai yn-i + Gn )2
– Compare to the computation of optimal predictors (Lecture 7).
Estimating the ParametersEstimating the Parameters
• Assuming a stationary signal:
where R and p contain acf values.
• This is called the autocorrelation method.
Estimating the ParametersEstimating the Parameters
• Alternatively, in case of a non-stationary signal:
where
• This is called the autocovariance method.
ExampleExample
• Coding of parameters using LPC10 (1984):
v/uv 1 bit
Pitch 6 bits
Voiced filter 46 bits
Unvoiced filter 46 bits
Synchronization 1 bit
Sum:Sum: 54 bits 54 bits )) 2.4 kbit/s 2.4 kbit/s
The Vocal Tract FilterThe Vocal Tract Filter
• Different representations:
– LPC parameters
– PARCOR (Partial Correlation Coefficients)
– LSF (Line Spectrum Frequencies)
• LPC analysis ) V(z) • Define perceptual weighting filter. This permits more noise at formant frequencies where it will be masked by the speech• Synthesise speech using each codebook entry in turn as the input to V(z)
• Calculate optimum gain to minimise perceptually weighted error energy in speech frame• Select codebook entry that gives lowest error
Decoding: • Receive LPC parameters and codebook index• Re-synthesise speech using V(z) and codebook entry
Encoding:
• Transmit LPC parameters and codebook index
Performance:• 16kbit/s: MOS=4.2, Delay=1.5 ms, 19 MIPS• 8 kbit/s: MOS=4.1, Delay=35 ms, 25 MIPS • 2.4kbit/s: MOS=3.3, Delay=45 ms, 20 MIPS
Code Excited Linear Prediction Coding (CELP)Code Excited Linear Prediction Coding (CELP)
ExamplesExamples
• G.728– V(z) is chosen as a large FIR-filter (M ¼ 50).
– The gain and FIR-parametrers are estimated recursively from previously received samples.
– The code book contains 127 sequences.
• GSM– The code book contains regular pulse trains with variabel
frequency and amplitudes.
• MELP– Mixed excitation linear prediction
– The code book is combined with a noise generator.
Other VariationsOther Variations
• SELP – Self Excited Linear Prediction
• MPLP – Multi-Pulse Excited Linear Prediction
• MBE – Multi-Band Excitation Coding
Quality LevelsQuality Levels
Quality level Bandwidth Bitrate
Broadcast quality 10 kHz >64 kbit/s
Network (tool) quality 300 – 3400 kHz 16 – 64 kbit/s
Communication quality 4 – 16 kbit/s
Synthetic quality <4 kbit/s
• MOS (Mean Opinion Score): result of averaging opinions scores for a set of between 20 – 60 untrained subjects.
• They rate the quality 1 to 5 (1-bad, 2-poor, 3-fair, 4-good, 5-excellent).
• MOS of 4 or higher defines good or tool quality (network quality) - reconstructed signal generally indistinguishable from the original.
• MOS between 3.5 – 4.0 defines communication quality – telephone communications
• MOS between 2.5 – 3.5 implies synthetic quality
• In digital communications speech quality is classified into four general categories, namely: broadcast, network or toll, communications, and synthetic.
• Broadcast wideband speech – high quality ”commentary” speech – generally achieved at rates above 64 kbits/s.
Subjective AssessmentSubjective Assessment
• DRT (Diagnostic Rhyme Test): listeners should recognise one of the two possible words in a set of rhyming pairs (e.g. meatl/heat)
• DAM (Diagnostic Acceptability Measure) - trained listeners judge various factors e.g. muffledness, buzziness, intelligibility
Quality versus data rate (8kHz sampling rate)
Subjective AssessmentSubjective Assessment