Date post: | 24-Apr-2015 |
Category: |
Technology |
Upload: | lizy-abraham |
View: | 2,266 times |
Download: | 0 times |
SPEECH SIGNAL PROCESSINGSPEECH SIGNAL PROCESSINGSPEECH SIGNAL PROCESSINGSPEECH SIGNAL PROCESSINGKERALA UNIVERSITY MKERALA UNIVERSITY MKERALA UNIVERSITY MKERALA UNIVERSITY M----TECH 1TECH 1TECH 1TECH 1
STSTSTSTSEMESTERSEMESTERSEMESTERSEMESTER
[email protected] Lizy Abraham
+919495123331 Assistant Professor+919495123331 Assistant Professor
Department of ECE
LBS Institute of Technology for Women
(A Govt. of Kerala Undertaking)
Poojappura
Trivandrum -695012
Kerala, India
1
SYLLABUS TSC 1004 SPEECH SIGNAL PROCESSING 3TSC 1004 SPEECH SIGNAL PROCESSING 3TSC 1004 SPEECH SIGNAL PROCESSING 3TSC 1004 SPEECH SIGNAL PROCESSING 3----0000----0000----3333
SpeechSpeechSpeechSpeech ProductionProductionProductionProduction :- Acoustic theory of speech production (Excitation, Vocal tract model for
speech analysis, Formant structure, Pitch). Articulatory Phonetic (Articulation, Voicing,
Articulatory model). Acoustic Phonetics ( Basic speech units and their classification).
SpeechSpeechSpeechSpeech AnalysisAnalysisAnalysisAnalysis :- Short-Time Speech Analysis, Time domain analysis (Short time energy, short
time zero crossing Rate, ACF ). Frequency domain analysis (Filter Banks, STFT, Spectrogram,
Formant Estimation &Analysis). Cepstral Analysis
ParametricParametricParametricParametric representationrepresentationrepresentationrepresentation ofofofof speechspeechspeechspeech :- AR Model, ARMA model. LPC Analysis ( LPC model, Auto
correlation method, Covariance method, Levinson-Durbin Algorithm, Lattice form).LSF, LAR,correlation method, Covariance method, Levinson-Durbin Algorithm, Lattice form).LSF, LAR,
MFCC, Sinusoidal Model, GMM, HMM
SpeechSpeechSpeechSpeech codingcodingcodingcoding :- Phase Vocoder, LPC, Sub-band coding, Adaptive Transform Coding , Harmonic
Coding, Vector Quantization based Coders, CELP
SpeechSpeechSpeechSpeech processingprocessingprocessingprocessing :- Fundamentals of Speech recognition, Speech segmentation. Text-to-
speech conversion, speech enhancement, Speaker Verification, Language Identification, Issues
of Voice transmission over Internet.
2
REFERENCEREFERENCEREFERENCEREFERENCE
1. Douglas O'Shaughnessy, Speech Communications : Human & Machine, IEEE
Press, Hardcover 2nd edition, 1999; ISBN: 0780334493.
2. Nelson Morgan and Ben Gold, Speech and Audio Signal Processing : Processing
and Perception Speech and Music, July 1999, John Wiley & Sons, ISBN:0471351547
3. Rabiner and Schafer, Digital Processing of Speech Signals, Prentice Hall, 1978.
4. Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1994. 4. Rabiner and Juang, Fundamentals of Speech Recognition, Prentice Hall, 1994.
5. Thomas F. Quatieri, Discrete-Time Speech Signal Processing: Principles and
Practice, Prentice Hall; ISBN: 013242942X; 1st edition
6. Donald G. Childers, Speech Processing and Synthesis Toolboxes, John Wiley &
Sons, September 1999; ISBN: 0471349593
For the End semester exam (100 marks), the question paper shall have six questions
of 20 marks each covering entire syllabus out of which any five shall be answered. It
shall have 75% problems & 25% Theory. For the internal marks of 50, Two test of 20
marks each and 10 marks for assignments (Minimum two) /Term Project.
3
Speech Processing means Processing of
discrete time speech signals
4
Speech Processing
Signal
Processing Information Phonetics
Acoustics
Algorithms
(Programming)Psychoacoustics
Room acoustics
Speech production
Processing Information
TheoryPhonetics
Fourier transforms
Discrete time filters
AR(MA) models
Entropy
Communication theory
Rate-distortion theory
Statistical SP
Stochastic
models
5
6
7
HOW IS SPEECH PRODUCED ?HOW IS SPEECH PRODUCED ?HOW IS SPEECH PRODUCED ?HOW IS SPEECH PRODUCED ?
Speech can be defined as “ a pressure Speech can be defined as “ a pressure Speech can be defined as “ a pressure Speech can be defined as “ a pressure acoustic signal that is articulated in the acoustic signal that is articulated in the acoustic signal that is articulated in the acoustic signal that is articulated in the vocal tract”vocal tract”vocal tract”vocal tract”vocal tract”vocal tract”vocal tract”vocal tract”
Speech is produced when: air is forced Speech is produced when: air is forced Speech is produced when: air is forced Speech is produced when: air is forced from the lungs through the vocal cords from the lungs through the vocal cords from the lungs through the vocal cords from the lungs through the vocal cords and along the vocal tract.and along the vocal tract.and along the vocal tract.and along the vocal tract.
8
This air flow is referred to as “excitation signal”. This air flow is referred to as “excitation signal”. This air flow is referred to as “excitation signal”. This air flow is referred to as “excitation signal”.
This excitation signal causes the vocal cords to This excitation signal causes the vocal cords to This excitation signal causes the vocal cords to This excitation signal causes the vocal cords to vibrate and propagate the energy to excite the oral vibrate and propagate the energy to excite the oral vibrate and propagate the energy to excite the oral vibrate and propagate the energy to excite the oral and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in and nasal openings, which play a major role in shaping the sound produced.shaping the sound produced.shaping the sound produced.shaping the sound produced.
Vocal Tract components:Vocal Tract components:Vocal Tract components:Vocal Tract components:–––– Oral Tract: (from lips to vocal cords).Oral Tract: (from lips to vocal cords).Oral Tract: (from lips to vocal cords).Oral Tract: (from lips to vocal cords).–––– Nasal Tract: (from the velum till Nasal Tract: (from the velum till Nasal Tract: (from the velum till Nasal Tract: (from the velum till nostrillsnostrillsnostrillsnostrills).).).).
9
10
11
• Larynx: the source of speech
• Vocal cords (folds): the two folds of tissue in the larynx. They
can open and shut like a pair of fans.
• Glottis: the gap between the vocal cords. As air is forced
through the glottis the vocal cords will start to vibrate and
modulate the air flow.
• The frequency of vibration determines the pitch of the voice (for
a male, 50-200Hz; for a female, up to 500Hz).
12
SPEECH PRODUCTION MODEL
13
Places of articulation
dentalalveolar post-alveolar/palatal
velar
uvularlabial
uvular
pharyngeal
laryngeal/glottal
14
Classes of speech sounds
Voiced sound The vocal cords vibrate open and close
Quasi-periodic pulses of air
The rate of the opening and closing – the pitch
Unvoiced sounds Forcing air at high velocities through a constriction Forcing air at high velocities through a constriction
Noise-like turbulence
Show little long-term periodicity
Short-term correlations still present
Eg. “S”, “F”
Plosive sounds A complete closure in the vocal tract
Air pressure is built up and released suddenly
Eg. “B” , “P”15
Speech Model
16
SPEECH SOUNDS
Coarse classification with phonemes.
A phone is the acoustic realization of a A phone is the acoustic realization of a
phoneme.
Allophones are context dependent
phonemes.
17
PHONEME HIERARCHY
Speech soundsSpeech soundsSpeech soundsSpeech sounds
VowelsVowelsVowelsVowels ConsonantsConsonantsConsonantsConsonantsDiphtongsDiphtongsDiphtongsDiphtongs
iy, ih, ae, aa,
ah, ao,ax, eh,ay, ey,
oy, aw
Language dependent.
About 50 in English.
PlosivePlosivePlosivePlosive
NasalNasalNasalNasalFricativeFricativeFricativeFricative
RetroflexRetroflexRetroflexRetroflex
liquidliquidliquidliquid
LateralLateralLateralLateral
liquidliquidliquidliquid
GlideGlideGlideGlide
ah, ao,ax, eh,
er, ow, uh, uwoy, aw
w, y
p, b, t,
d, k, gm, n, ng
f, v, th, dh,
s, z, sh, zh, h
r
l
18
19
20
sounds like /SH/ and /S/ look like
(spectrally shaped) random noise,
while the vowel sounds /UH/, /IY/,
and /EY/ are highly structured and
quasi-periodic.
These differences result from the
distinctively different ways that these
sounds are produced.
21
22
Vowel Chart
Front
High
BackCenter
i
ɪ
u
ʊ
ɪ
o ə ʌ
e
Mid
Lowa
ɪ ə ʌ
ɛ
ɪ æ
24
SPEECH WAVEFORM CHARACTERISTICS
Loudness
Voiced/Unvoiced.
Pitch.
Fundamental frequency.
Spectral envelope.
Formants.
25
Pitch:Signal within each voiced interval is periodic. The period T is
called “pitch”. The pitch depends on the vowel being spoken,
changes in time. T~70 samples in this ex.
f0=1/T is the fundamental frequency (also known as formant
frequency).
Acoustic Characteristics of speech
frequency).
26
FORMANTS
Formants can be recognized in the frequency content
of the signal segment.
Formants are best described as high energy peaks in the
frequency spectrum of speech sound.
27
The resonant frequencies of the vocal tract are
called formant frequencies or simply formants.
The peaks of the spectrum of the vocal tract
response correspond approximately to its response correspond approximately to its
formants.
Under the linear time-invariant all-pole
assumption, each vocal tract shape is
characterized by a collection of formants.
28
Because the vocal tract is assumed stable with
poles inside the unit circle, the vocal tract
transfer function can be expressed either in
product or partial fraction expansion form: product or partial fraction expansion form:
29
30
A detailed acoustic theory must consider the effects of the
following:
• Time variation of the vocal tract shape
• Losses due to heat conduction and viscous friction at the
vocal tract wallsvocal tract walls
• Softness of the vocal tract walls
• Radiation of sound at the lips
• Nasal coupling
• Excitation of sound in the vocal tract
Let us begin by considering a simple case of a lossless tube:
31
28 December 2012
MULTI-TUBE APPROXIMATION OF THE VOCAL
TRACT
We can represent the vocal tract as a concatenation of N lossless tubes with area Ak.and equal length ∆x = l/N
The wave propagation time through each tube is τ =∆x/c = l/Nc
32
33
Consider an N-tube model of the previous figure. Each tube has length lkand cross sectional area of Ak.
Assume:
No losses
Planar wave propagation
The wave equations for section k: 0≤x≤l The wave equations for section k: 0≤x≤lk
34
35
28 December 2012
SOUND PROPAGATION IN THE CONCATENATED
TUBE MODEL
Boundary conditions:
Physical principle of continuity:
Pressure and volume velocity must be continuous both in time and in space
everywhere in the system:
At k’th/(k+1)’st junction we have:
36
28 December 2012
ANALOGY WITH ELECTRICAL CIRCUIT
TRANSMISSION LINE
37
28 December 2012
PROPAGATION OF SOUND IN A UNIFORM TUBE
The vocal tract transfer function of volume velocities is
38
The vocal tract transfer function of volume velocities is
28 December 2012
PROPAGATION OF SOUND IN A UNIFORM TUBE
Using the boundary conditions U (0,s)=UG(s) and P(-l,s)=0
*(derivation in Quateri text: page 122 – 125)
39
The poles of the transfer function T (jΩ) are where cos(Ωl/c)=0
119 – 124: Quatieri
Derivation of eqn.4.18 is
important.
28 December 2012
PROPAGATION OF SOUND IN A UNIFORM TUBE
(CON’T)
For c =34,000 cm/sec, l =17 cm, the natural frequencies (also called the formants) are at 500 Hz, 1500 Hz, 2500 Hz, …
40
The transfer function of a tube with no side branches, excited at one end and response measured at another, only has poles
The formant frequencies will have finite bandwidth when vocal tract losses are considered (e.g., radiation, walls, viscosity, heat)
The length of the vocal tract, l, corresponds to 1/4λ1, 3/4λ2, 5/4λ3, …, where λi is the wavelength of the ith natural frequency
28 December 2012
UNIFORM TUBE MODEL
Example
Consider a uniform tube of length l=35 cm. If speed
of sound is 350 m/s calculate its resonances in Hz.
Compare its resonances with a tube of length l =
41
Compare its resonances with a tube of length l =
17.5 cm.
f=Ω/2π ⇒
,...1250,750,250f
25035.04
350
2
1
2k
2f
1,3,5,...k ,2
k
=
=×
==Ω
=
==Ω
kkl
c
l
c
π
π
π
π
28 December 2012
UNIFORM TUBE MODELUNIFORM TUBE MODELUNIFORM TUBE MODELUNIFORM TUBE MODEL
For 17.5 cm tube:
,...2500,1500,500f
250175.04
350
2
1
2k
2f
=
=×
==Ω
= kkl
c
π
π
π
42
,...2500,1500,500f =
43
APPROXIMATING VOCAL TRACT SHAPES
44
45
VOWELS
Modeled as a tube closed at one end and open at the other
the closure is a membrane with a slit in it
the tube has uniform cross sectional area
membrane represents the source of energy (vocal folds)
the energy travels through the tube
the tube generates no energy on its own
the tube represents an important class of resonators
odd quarter length relationship
Fn=(2n-1)c/4l
VOWELS
Filter characteristics for vowels
the vocal tract is a dynamic filter
it is frequency dependent
it has, theoretically, an infinite number of resonances
each resonance has a center frequency, an amplitude and a
bandwidthbandwidth
for speech, these resonances are called formants
formants are numbered in succession from the lowest
F1, F2, F3, etc.
Fricatives
Modeled as a tube with a very severe constriction
The air exiting the constriction is turbulent
Because of the turbulence, there is no periodicity
unless accompanied by voicing
When a fricative constriction is tapered
the back cavity is involved
this resembles a tube closed at both ends
Fn=nc/2lFn=nc/2l
such a situation occurs primarily for articulation
disorders
Introduction to Digital Speech Processing
(Rabiner & Schafer )– 20-23
51
52
Rabiner &
Schafer : 98-
105
53
54
28 December 2012
SOUND SOURCE:
VOCAL FOLD VIBRATION
Modeled as a volume velocity source at glottis, UG(jΩ)
55
56
SHORT-TIME SPEECH ANALYSIS
Segments (or frames, or vectors) are typically of
length 20 ms.
Speech characteristics are constant.
Allows for relatively simple modeling. Allows for relatively simple modeling.
Often overlapping segments are extracted.
57
SHORTSHORTSHORTSHORT----TIME ANALYSIS OF SPEECHTIME ANALYSIS OF SPEECHTIME ANALYSIS OF SPEECHTIME ANALYSIS OF SPEECH
58
the system is an all-pole system with system function of the form:
For all-pole linear systems, the input and output are related by
a difference equation of the form:
59
60
The operator T defines the nature of the
short-time analysis function, and w[ˆn − m]
represents a time shifted window sequence
61
62
SHORT-TIME ENERGY
simple to compute, and useful for estimating
properties of the excitation function in the
model.
In this case the operator T is simply
squaring the windowed samples.
63
SHORT-TIME ZERO-CROSSING RATE
Weighted average of the number of times the
speech signal changes sign within the time
window. Representing this operator in terms of
linear filtering leads to:linear filtering leads to:
64
Since |sgnx[m] − sgnx[m − 1]| is equal to 1
if x[m] and x[m − 1] have different algebraic
signs and 0 if they have the same sign, it
follows that it is a weighted sum of all the follows that it is a weighted sum of all the
instances of alternating sign (zero-crossing)
that fall within the support region of the shifted
window w[ˆn − m].
65
shows an example of the short-time energy and
zero crossing rate for a segment of speech with
a transition from unvoiced to voiced speech.
In both cases, the window is a Hamming In both cases, the window is a Hamming
window of duration 25ms (equivalent to 401
samples at a 16 kHz sampling rate).
Thus, both the short-time energy and the
short-time zero-crossing rate are output of a
low pass filter whose frequency response is as
shown.66
Short time energy and zero-crossing rate functions are slowly varying Short time energy and zero-crossing rate functions are slowly varying
compared to the time variations of the speech signal, and therefore, they
can be sampled at a much lower rate than that of the original speech
signal.
For finite-length windows like the Hamming window, this reduction of
the sampling rate is accomplished by moving the window position ˆn in
jumps of more than one sample
67
during the unvoiced interval, the zero-crossing
rate is relatively high compared to the zero-
crossing rate in the voiced interval.
Conversely, the energy is relatively low in the Conversely, the energy is relatively low in the
unvoiced region compared to the energy in the
voiced region.
68
SHORT-TIME AUTOCORRELATION FUNCTION
(STACF)
The autocorrelation function is often used as a means
of detecting periodicity in signals, and it is also the
basis for many spectrum analysis methods.
STACF is defined as the deterministic autocorrelation
function of the sequence xˆn[m] = x[m]w[ˆn − m] that function of the sequence xˆn[m] = x[m]w[ˆn − m] that
is selected by the window shifted to time ˆn, i.e.,
69
70
e[n] is the excitation to the
linear system with impulse response h[n]. A
well known, and easily
proved, property of the autocorrelation
function is thatfunction is that
i.e., the autocorrelation function of s[n] =
e[n] h[n] is the convolution
of the autocorrelation functions of e[n] and
h[n].
71
72
SHORT-TIME FOURIER TRANSFORM (STFT)
The expression for the discrete-time STFT at
time n
where w[n] is assumed to be non-zero only
in the interval [0, N w - 1] and is referred to
as analysis window or sometimes as the
analysis filter
73
74
FILTERING VIEW
75
76
77
SHORT TIME SYNTHESIS
problem of obtaining a sequence back from its
discrete-time STFT.
This equation represents a synthesis
equation for the discrete-time STFT.
78
FILTER BANK SUMMATION (FBS) METHOD
the discrete STFT is considered to be the set of
outputs of a bank of filters.
the output of each filter is modulated with a
complex exponential, and these modulated complex exponential, and these modulated
filter outputs are summed at each instant of
time to obtain the corresponding time sample
of the original sequence
That is, given a discrete STFT, X (n, k), the FBS
method synthesize a sequence y(n) satisfying
the following equation: 79
80
81
82
83
OVERLAP-ADD METHOD
Just as the FBS method was motivated from the
filteling view of the STFT, the OLA method is motivated
from the Fourier transform view of the STFT.
In this method, for each fixed time, we take the
inverse DFT of the corresponding frequency function inverse DFT of the corresponding frequency function
and divide the result by the analysis window.
However, instead of dividing out the analysis window
from each of the resulting short-time sections, we
perform an overlap and add operation between the
short-time sections.
84
given a discrete STFT X (n, k), the OLA method
synthesizes a sequence Y[n] given by
85
86
Furthermore, if the discrete STFT had been
decimated in time by a factor L, it can be
similarly shown that if the analysis window
satisfiessatisfies
87
88
DESIGN OF DIGITAL FILTER BANKS
282 – 297: Rabiner & Schafer
89
90
91
92
USING IIR FILTER
93
94
USING FIR FILTER
95
96
97
98
99
100
FILTER BANK ANALYSIS AND SYNTHESIS
101
102
103
FBS synthesis results in multiple copies of the
input:
104
PHASE VOCODER
The fourier series is computed over a sliding
window of a single pitch period duration and
provide a measure of amplitude and frequency
trajectories of the musical tones.trajectories of the musical tones.
105
106
107
which can be interpreted as a real sinewave
that is amplitude- and phase-modulated by the
STFT, the "carrier" of the latter being the kth
filter's center frequency. filter's center frequency.
the STFT of a continuos time signal as,
108
109
where is an initial condition.
The signal is likewise referred to as the
instantaneous amplitude for each channel. The
resulting filter-bank output is a sinewave with resulting filter-bank output is a sinewave with
generally a time-varying amplitude and
frequency modulation.
An alternative expression is,
110
which is the time-domain counterpart to the
frequency-domain phase derivative.
111
we can sample the continuous-time STFT, with
sampling interval T, to obtain the discrete-time
STFT.
112
113
114
115
116
117
SPEECH MODIFICATION
118
119
120
121
122
HOMOMORPHICHOMOMORPHICHOMOMORPHICHOMOMORPHIC ((((CEPSTRALCEPSTRALCEPSTRALCEPSTRAL) SPEECH ANALYSIS) SPEECH ANALYSIS) SPEECH ANALYSIS) SPEECH ANALYSIS
use of the short-time cepstrum as a representation of
speech and as a basis for estimating the parameters
of the speech generation model.
cepstrum of a discrete-time signal,
123
124
That is, the complex cepstrum operator
transforms convolution into addition.
This property, is what makes the cepstrum
useful for speech analysis, since the model foruseful for speech analysis, since the model for
speech production involves convolution of the
excitation with the vocal tract impulse
response, and our goal is often to separate the
excitation signal from the vocal tract signal.
125
The key issue in the definition and computation
of the complex cepstrum is the computation of
the complex logarithm.
ie, the computation of the phase angle ie, the computation of the phase angle
arg[X(ejω)], which must be done so as to
preserve an additive combination of phases for
two signals combined by convolution
126
THE SHORTTHE SHORTTHE SHORTTHE SHORT----TIME TIME TIME TIME CEPSTRUMCEPSTRUMCEPSTRUMCEPSTRUM
The short-time cepstrum is a sequence of
cepstra of windowed finite-duration segments
of the speech waveform.
127
128
RECURSIVE COMPUTATION OF THE COMPLEX RECURSIVE COMPUTATION OF THE COMPLEX RECURSIVE COMPUTATION OF THE COMPLEX RECURSIVE COMPUTATION OF THE COMPLEX
CEPSTRUMCEPSTRUMCEPSTRUMCEPSTRUM
Another approach to compute the complex
cepstrum applies only to minimum-phase
signals.
i.e., signals having an z-transform whose poles i.e., signals having an z-transform whose poles
and zeros are inside the unit circle.
An example would be the impulse response of
an all-pole vocal tract model with system
function
129
In this case, all the poles ck must be inside
the unit circle
for stability of the system.
130
SHORTSHORTSHORTSHORT----TIME TIME TIME TIME HOMOMORPHICHOMOMORPHICHOMOMORPHICHOMOMORPHIC FILTERING OF FILTERING OF FILTERING OF FILTERING OF
SPEECH SPEECH SPEECH SPEECH ––––PAGE N0: 63, PAGE N0: 63, PAGE N0: 63, PAGE N0: 63, RABINERRABINERRABINERRABINER & & & & SCHAFERSCHAFERSCHAFERSCHAFER
131
The low quefrency part of the cepstrum is
expected to be representative of the slow
variations (with frequency) in the log spectrum,
while the high quefrency components would while the high quefrency components would
correspond to the more rapid fluctuations of
the log spectrum.
132
the spectrum for the voiced segment has a structure of periodic ripples
due to the harmonic structure of the quasi-periodic segment of voiced
speech.
This periodic structure in the log spectrum manifests itself in the
cepstrum peak at a quefrency of about 9ms.
The existence of this peak in the quefrency range of expected pitch
periods strongly signals voiced speech.periods strongly signals voiced speech.
Furthermore, the quefrency of the peak is an accurate estimate of the
pitch period during the corresponding speech interval.
the autocorrelation function also displays an indication of periodicity, but
not nearly as unambiguously as does the cepstrum.
But the rapid variations of the unvoiced spectra appear random with no
periodic structure.
As a result, there is no strong peak indicating periodicity as in the voiced
case.
133
These slowly varying log spectra clearly retain
the general spectral shape with peaks
corresponding to the formant resonance
structure for the segment of speech under structure for the segment of speech under
analysis.
134
APPLICATION TO PITCH DETECTIONAPPLICATION TO PITCH DETECTIONAPPLICATION TO PITCH DETECTIONAPPLICATION TO PITCH DETECTION
The cepstrum was first applied in speech
processing to determine the excitation
parameters for the discrete-time speech model.
The successive spectra and cepstra are for 50 The successive spectra and cepstra are for 50
ms segments obtained by moving the window
in steps of 12.5 ms (100 samples at a
sampling rate of 8000 samples/sec).
135
for the positions 1 through 5, the window includes only
unvoiced speech
for positions 6 and 7 the signal within the window is partly
voiced and partly unvoiced.
For positions 8 through 15 the window only includes voiced For positions 8 through 15 the window only includes voiced
speech.
the rapid variations of the unvoiced spectra appear random
with no periodic structure.
the spectra for voiced segments have a structure of periodic
ripples due to the harmonic structure of the quasi-periodic
segment of voiced speech.
136
137
the cepstrum peak at a quefrency of about 11–
12 ms strongly signals voiced speech, and the
quefrency of the peak is an accurate estimate
of the pitch period during the corresponding of the pitch period during the corresponding
speech interval.
Presence of a strong peak implies voiced
speech, and the quefrency location of the peak
gives the estimate of the pitch period.
138
MELMELMELMEL----FREQUENCY FREQUENCY FREQUENCY FREQUENCY CEPSTRUMCEPSTRUMCEPSTRUMCEPSTRUM COEFFICIENTS COEFFICIENTS COEFFICIENTS COEFFICIENTS
((((MFCCMFCCMFCCMFCC))))
The idea is to compute a frequency analysis based
upon a filter bank with approximately critical band
spacing of the filters and bandwidths.
For 4 KHz bandwidth, approximately 20 filters are
used.used.
a short-time Fourier analysis is done first, resulting in
a DFT Xˆn[k] for analysis time ˆn.
Then the DFT values are grouped together in critical
bands and weighted by a triangular weighting
function.
139
the bandwidths are constant for center
frequencies below 1 kHz and then increase
exponentially up to half the sampling rate of 4
kHz resulting in a total of 22 filters.kHz resulting in a total of 22 filters.
The mel-frequency spectrum at analysis timeˆn
is defined for r = 1,2,...,R as
140
141
is a normalizing factor for the rth mel-filter.
For each frame, a discrete cosine transform of
the log of the magnitude of the filter outputs is
computed to form the function mfccˆn[m], i.e.,computed to form the function mfccˆn[m], i.e.,
142
143
shows the result of mfcc analysis of a frame of
voiced speech in comparison with the short-
time Fourier spectrum, LPC spectrum, and a
homomorphically smoothed spectrum.homomorphically smoothed spectrum.
all these spectra are different, but they have in
common that they have peaks at the formant
resonances.
At higher frequencies, the reconstructed mel-
spectrum has more smoothing due to the
structure of the filter bank.144
THE SPEECH SPECTROGRAMTHE SPEECH SPECTROGRAMTHE SPEECH SPECTROGRAMTHE SPEECH SPECTROGRAM
simply a display of the magnitude of the STFT.
Specifically, the images in Figure are plots of
where the plot axes are labeled in terms of where the plot axes are labeled in terms of
analog time and frequency through the
relations tr = rRT and fk = k/(NT), where T is
the sampling period of the discrete-time signal
x[n] = xa(nT).
145
In order to make smooth, R is usually quite
small compared to both the window length L
and the number of samples in the frequency
dimension, N, which may be much larger than dimension, N, which may be much larger than
the window length L.
Such a function of two variables can be plotted
on a two dimensional surface as either a gray-
scale or a color-mapped image.
The bars on the right calibrate the color map (in
dB).146
147
if the analysis window is short, the spectrogram
is called a wide-band spectrogram which is
characterized by good time resolution and poor
frequency resolution.frequency resolution.
when the window length is long, the
spectrogram is a narrow-band spectrogram,
which is characterized by good frequency
resolution and poor time resolution.
148
THE SPECTROGRAM
•A classic analysis tool.
– Consists of DFTs of overlapping, and
windowed frames.windowed frames.
•Displays the distribution of energy in time
and frequency.
– is typically displayed.2
10 )(log10 fX m
149
THE SPECTROGRAM CONT.
150
151
Note the three broad peaks in the spectrum
slice at time tr = 430 ms, and observe that
similar slices would be obtained at other times
around tr = 430 ms.around tr = 430 ms.
These large peaks are representative of the
underlying resonances of the vocal tract at the
corresponding time in the production of the
speech signal.
152
The lower spectrogram is not as sensitive to
rapid time variations, but the resolution in the
frequency dimension is much better.
This window length is on the order of several This window length is on the order of several
pitch periods of the waveform during voiced
intervals.
As a result, the spectrogram no longer displays
vertically oriented striations since several
periods are included in the window.
153
SHORT TIME ACF/m/ /ow/ /s/
ACF
154
CEPSTRUMSPEECH WAVE (X)= EXCITATION (E) . FILTER (H)
(H)(H)(H)(H)(Vocal tract
filter)
Glottal excitation
From
(E)(E)(E)(E)
(S)(S)(S)(S)
155
http://home.hib.no/al/engelsk/seksjon/SOFF-MASTER/ill061.gif
From
Vocal cords
(Glottis)
CEPSTRAL ANALYSIS
Signal(s)=convolution(*) of
glottal excitation (e) and vocal_tract_filter (h)
s(n)=e(n)*h(n), n is time index
After Fourier transform FT: FTs(n)=FTe(n)*h(n)
Convolution(*) becomes multiplication (.)
156
n(time) w(frequency),
S(w) = E(w).H(w)
Find Magnitude of the spectrum
|S(w)| = |E(w)|.|H(w)|
log10 |S(w)|= log10|E(w)|+ log10|H(w)|
Ref: http://iitg.vlab.co.in/?sub=59&brch=164&sim=615&cnt=1
CEPSTRUM
C(n)=IDFT[log10 |S(w)|]=
IDFT[ log10|E(w)| + log10|H(w)| ]
windowing DFT Log|x(w)| IDFT
X(n) X(w) Log|x(w)|
N=time index
S(n) C(n)
157
In c(n), you can see E(n) and H(n) at two different positions
Application: useful for (i) glottal excitation (ii) vocal tract filter analysis
N=time index
w=frequency
I-DFT=Inverse-discrete Fourier transform
EXAMPLE OF CEPSTRUM
sampling frequency 22.05KHz
158
SUB BAND CODING
159
the time-decimated subband outputs are quantized
and encoded, then are decoded at the receiver.
In subband coding, a small number of filters with wide
and overlapping bandwidths are chosen and each
output is quantizedoutput is quantized
each bandpass filter output is quantized individually.
although the bandpass filters are wide and
overlapping, careful design of the filter, resuIts in a
cancellation of quantization noise that leaks across
bands.
160
Quadrature mirror filters are one such filter
class;
shows an example of a two-band subband
coder using two overlapping quadrature mirror coder using two overlapping quadrature mirror
filters
Quadrature mirror filters can be further
subdivided from high to low filters by splitting
the fullband into two, then the resulting lower
band into two, and so on.
161
This octave-band splitting, together with the
iterative decimation, can be shown to yield a
perfect reconstruction filter bank
such octave-band filter banks, and their such octave-band filter banks, and their
conditions for perfect reconstruction, are
closely related to wavelet analysis/synthesis
structures.
162
163
164
LINEAR PREDICTION (INTRODUCTION):
The object of linear prediction is to estimate
the output sequence from a linear combination
of input samples, past output samples or both :
∑∑pq
The factors a(i) and b(j) are called predictor
coefficients.
∑∑==
−−−=
p
i
q
j
inyiajnxjbny10
)()()()()(ˆ
165
LINEAR PREDICTION (INTRODUCTION):
Many systems of interest to us are describable by a linear, constant-coefficient difference equation :
If Y(z)/X(z)=H(z), where H(z) is a ratio of polynomials
∑∑==
−=−
q
j
p
i
jnxjbinyia00
)()()()(
If Y(z)/X(z)=H(z), where H(z) is a ratio of polynomials N(z)/D(z), then
Thus the predictor coefficients give us immediate access to the poles and zeros of H(z).
== ji 00
∑∑=
−
=
−==
p
i
iq
j
jziazDzjbzN
00
)()( and )()(
166
LINEAR PREDICTION (TYPES OF SYSTEM MODEL):
There are two important variants :
All-pole model (in statistics, autoregressive (AR)(AR)
model ) :
The numerator N(z) is a constant.The numerator N(z) is a constant.
All-zero model (in statistics, moving-average (MA)(MA)
model ) :
The denominator D(z) is equal to unity.
The mixed pole-zero model is called the
autoregressive moving-average (ARMA)(ARMA) model.
167
LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):
Given a zero-mean signal y(n), in the AR model :
The error is :
∑=
−−=
p
i
inyiany1
)()()(ˆ
−= nynyne )(ˆ)()(
To derive the predictor we use the orthogonality
principle, the principle states that the desired
coefficients are those which make the error orthogonal
to the samples y(n-1), y(n-2),…, y(n-p).
∑=
−=
p
i
inyia0
)()(
168
LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):
Thus we require that
Or,
p..., 2, 1,jfor 0)()( =>=−< nejny
0)()()(0
=−− ∑=
p
i
inyiajny
Interchanging the operation of averaging and summing, and representing < > by summing over n, we have
The required predictors are found by solving these equations.
p1,...,j ,0)()()(0
==−−∑∑= n
p
i
jnyinyia
169
LINEAR PREDICTION (DERIVATION OF LP EQUATIONS):
The orthogonality principle also states that resulting
minimum error is given by
Or,
)()()(2nenyneE ==
Enyinyiap
=−∑∑ )()()(
We can minimize the error over all time :
where
Eriap
i
i =∑= 0
)(
Enyinyiani
=−∑∑=
)()()(0
∑∞
−∞=
−=
n
i inynyr )()(
, ...,p,jria ji
p
i
21 ,0)(0
==−
=
∑
170
LINEAR PREDICTION (APPLICATIONS):
Autocorrelation matching :
We have a signal y(n) with known autocorrelation
. We model this with the AR system shown below :
)(nryy)(ne )(ny)(nryy
∑=
−−
==p
i
i
i zazA
zH
1
1)(
)(σσ
)(ne
σ
1-A(z)
)(ny
171
LINEAR PREDICTION (ORDER OF LINEAR PREDICTION):
The choice of predictor order depends on the analysis bandwidth. The rule of thumb is :
For a normal vocal tract, there is an average of
cBW
p +=1000
2
For a normal vocal tract, there is an average of about one formant per kilo Hertz of BW.
One formant requires two complex conjugate poles.
Hence for every formant we require two predictor coefficients, or two coefficients per kilo Hertz of bandwidth.
172
LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL):
True Model:
DT
Impulse
G(z)
GlottalVoiced
Pitch Gain
U(n)
s(n)
Speech
SignalImpulse
generator
Glottal
Filter
Uncorrelated
Noise
generator
H(z)
Vocal tract
Filter
R(z)
LP
Filter
Voiced
Unvoiced
Gain
V
U
U(n)
Voiced
Volume
velocity
173
LINEAR PREDICTION (AR MODELING OF SPEECH SIGNAL):
Using LP analysis :
DT
ImpulseVoiced
Pitch
Gain
estimate s(n)Impulse
generator
White
Noise
generator
All-Pole
Filter
(AR)
Voiced
Unvoiced
estimate
V
U
H(z)
s(n)
Speech
Signal