1
Extraction of Features, V, Zheng-Hua Tan 1
Extraction and Representation of Features, Spring 2011
Zheng-Hua Tan
Multimedia Information and Signal Processing Department of Electronic Systems
Aalborg University, Denmark [email protected]
Lecture 5: Speech and Audio Analysis
Feature extraction
A special form of dimensionality reduction, used when the input data is Too large to be stored or processed
Redundant (much data, but not much information)
Data is transformed into a compact representation – a set of features.
Speech and audio signals have a lot in common.
2Extraction of Features, V, Zheng-Hua Tan
2
Extraction of Features, V, Zheng-Hua Tan 3
Speech analysis
Most applications of speech processing must exploit the properties of speech signals Speech Analysis: the process of extracting such properties from a speech signal.
speech analysis(DSP)
speechrepresentationof speech
applications: e.g. pitch, formants,
boundary detection, speech & speaker
recognition
Extraction of Features, V, Zheng-Hua Tan 4
Speech analysis: Short-time analysis
Short-time speech analysis
Time-domain processing
Frequency-domain (spectral) processing
Linear predictive coding (LPC) analysis
Cepstral analysis
Filter bank analysis
3
Extraction of Features, V, Zheng-Hua Tan 5
Properties of speech signals
Speech is a time-varying signal: excitation
pitch
amplitude
Extraction of Features, V, Zheng-Hua Tan 6
Short-time processing solution
Assuming that speech has non-time-varying properties (fixed excitation and vocal tract) within short intervals
Processing short segments (frames) of the speech signal each time
)()(),( mnwmxmnf x
4
Extraction of Features, V, Zheng-Hua Tan 7
Frame-by-frame processing
Frames often overlap one another
The frame-based analysis yields a time-varying sequence as a new representation of the speech signal samples at 8000/sec vectors at 100/sec
Extraction of Features, V, Zheng-Hua Tan 8
Windows
10,1][ Nnnw
10),1
2cos(46.054.0][
Nn
N
nnw
Rectangular window
Hamming window
5
Extraction of Features, V, Zheng-Hua Tan 9
Choice of window
Window type Bandwidth of Hamming window is about twice the
bandwidth of Rectangular
Attenuation of more than 40dB for Hamming as compared with 14 dB for Rectangular, outside passband
Window duration - N Increase N = decrease window bandwidth
N should be larger than a pitch period, but smaller than a sound duration
Extraction of Features, V, Zheng-Hua Tan 10
Speech analysis: Time-domain
Short-time speech analysis
Time-domain speech processing
Frequency-domain (spectral) processing
Linear predictive coding (LPC) analysis
Cepstral analysis
Filter bank analysis
6
Extraction of Features, V, Zheng-Hua Tan 11
Time-domain parameters
Short-time energy
Short-time average magnitude
Short-time zero crossing rate
Short-time autocorrelation
Short-time average magnitude difference
Extraction of Features, V, Zheng-Hua Tan 12
Short-time energy
The long term energy definition is not useful for time-varying signals
Short-time energy of weighted signal around n is defined as
m
mxE )(2
mn mnwmxE 2)]()([
7
Extraction of Features, V, Zheng-Hua Tan 13
Examples of short-time energy
It can be used to detection voiced/unvoiced/silence Effects of window type, duration N (bandwidth) and why?
Uttered by a male speaker. Two plots converge as N increases.
Extraction of Features, V, Zheng-Hua Tan 14
Short-time magnitude
Less sensitive to large signal levels as compared to energy where x2(n) terms is used.
mn mnwmxM )(|)(|
8
Extraction of Features, V, Zheng-Hua Tan 15
Short-time average zero-crossing rate
A zero-crossing occurs if successive samples have different algebraic signs.
It is a measure of the frequency. Definition
where
and
mn mnwmxmxZ )(|)]1(sgn[)](sgn[|
0)(1
0)(1)](sgn[
nx
nxnx
otherwise
NnNnw0
102
1)(
Extraction of Features, V, Zheng-Hua Tan 16
Zero-crossing rate distributions
A histogram of average zero-crossing rates (averaged over 10 msec) for both voiced and unvoiced speech
In different frequency bands
(80/2)/10ms=4kHz
9
Extraction of Features, V, Zheng-Hua Tan 17
Example of zero-crossing rate
Although the zero-crossing rate varies considerably, the voiced and unvoiced regions are quite prominent.
Extraction of Features, V, Zheng-Hua Tan 18
Short-time autocorrelation function
The autocorrelation function
The short-time autocorrelation function
m
kmxmxk )()()(
)()()()()( mknwkmxmnwmxkRm
n
10
Extraction of Features, V, Zheng-Hua Tan 19
Applications
Boundary detection short-time energy
zero crossing rate
Pitch estimation short-time autocorrelation function
Extraction of Features, V, Zheng-Hua Tan 20
Speech analysis: Frequency-domain
Short-time speech analysis
Time-domain speech processing
Frequency-domain (spectral) processing
Linear predictive coding (LPC) analysis
Cepstral analysis
Filter bank analysis
11
Extraction of Features, V, Zheng-Hua Tan 21
Discrete-time Fourier transform
Convolution and multiplication duality:
dweeXnx
enxeX
jwnjw
n
jwnjw
)(2
1][
][)(
)()()(
][*][][jwjwjw eHeXeY
nhnxny
deXeWeY
nwnxnywjjjw )()(
2
1)(
][][][)(
Extraction of Features, V, Zheng-Hua Tan 22
Short-time Fourier transform
It is motivated by the need for a spectral representation to reflect the time-varying properties of the speech waveform
m
jwmjwn emxmnweX ][][)(
12
Extraction of Features, V, Zheng-Hua Tan 23
Spectra
Extraction of Features, V, Zheng-Hua Tan 24
Spectra of voiced/unvoiced sounds
13
Extraction of Features, V, Zheng-Hua Tan 25
Spectrogram
Spectrogram two-dimensional waveform (amplitude/time) is
converted into a three-dimensional pattern (amplitude/frequency/time)
Wideband spectrogram: analyzed on 15ms sections of waveform with a step of 1ms voiced regions with vertical striations due to the
periodicity of the time waveform (each vertical line represents a pulse of vocal folds) while unvoiced regions are solid/random, or ‘snowy’
Narrowband spectrogram: on 50ms pitch for voiced intervals in horizontal lines
Extraction of Features, V, Zheng-Hua Tan 26
Wide- and narrow-band spectrograms
Wideband spectrogram
waveform
narrowband spectrogram
F1
F2
F3
14
Extraction of Features, V, Zheng-Hua Tan 27
Speech analysis: LPC analysis
Short-time speech analysis
Time-domain speech processing
Frequency-domain (spectral) processing
Linear predictive coding (LPC) analysis
Cepstral analysis
Filter bank analysis
Extraction of Features, V, Zheng-Hua Tan 28
Discrete-time filter model for speech
Its philosophy is related to the speech model in which speech is modelled as the output of a linear, time-varying system excited by either quasi-periodic pulses or random noise.
The LPC provides a robust and accurate method for estimating the parameters of the time-varying system.
H(z)
15
Extraction of Features, V, Zheng-Hua Tan 29
LPC analysis
For efficient coding, speech signals are often modelled using parameters of the vocal tract shape that generates them.
Pole-zero model (ideal during a stationary frame)
All-pole model (simple): a matter of analytical necessity
p
k
kk
q
l
ll
za
zb
GzU
zSzH
1
1
1
1
)(
)()(
p
k
kk za
GzU
zSzH
1
1
1
)(
)()(
0
1 11
poles multiple zero one
n
nnza
az
Extraction of Features, V, Zheng-Hua Tan 30
All-pole model – the LPC model
where u(n) is a normalised excitation and G is the
gain of the excitation
p
k
kk za
GzU
zSzH
1
1
1
)(
)()( )(
1
)(
1
zU
za
GzS
p
k
kk
)()()(1
zGUzazSzSp
k
kk
)()()(1
nGuknsansp
kk
16
Extraction of Features, V, Zheng-Hua Tan 31
After excluding the excitation term, a given speech sample at time n, s(n), can be approximated as a linear combination of the past p speech samples:
where the coefficients are assumed constant over the speech frame.
LPC analysis: to determine a set of predictor coefficients directly from the speech signal.
The LPC model
)(...)2()1()(~21 pnsnsnsns p
p ,...,, 21
}{ k
Extraction of Features, V, Zheng-Hua Tan 32
LPC analysis equations
Windowed speech:
Error of linear predictor
Method: minimise mean-squared prediction error
Short-time average prediction error
)(ˆ)()( nsnsne
p
kk knsnsne
1
)()()(
)()()( nwnsnx
2
1
2 ])()([)(
m
p
knkn
mnn kmsmsmeE
17
Extraction of Features, V, Zheng-Hua Tan 33
LPC analysis equations (cont’d)
Find such that is minimal
Resulting in
Define covariance
Then
piE
i
n ,...,2,1for 0
p
k mnnk
mnn kmsimsmsims
1
)()(ˆ)()(
m
nnn kmsimski )()(),(
piikip
knnk ,...,2,1 )0,(),(ˆ
1
nEk
2
1
2 ])()([)(
m
p
knkn
mnn kmsmsmeE
Extraction of Features, V, Zheng-Hua Tan 34
Short-time LP analysis
To solve the following equation for the optimum predictor coefficients (the s)
we have to compute and then solve the resulting set of p equations.
pikiaip
kk ,...,2,1 ),(ˆ)0,(
1
k̂
),( ki
18
Extraction of Features, V, Zheng-Hua Tan 35
Speech analysis: Cepstral analysis
Short-time speech analysis
Time-domain speech processing
Frequency-domain (spectral) processing
Linear predictive coding (LPC) analysis
Cepstral analysis
Filter bank analysis
Extraction of Features, V, Zheng-Hua Tan 36
Homomorphic speech processing
Again, speech is modelled as the output of a linear, time-varying system (linear time-invariant (LTI) in short seg.) excited by either quasi-periodic pulses or random noise.
The problem of speech analysis is to estimate the parameters of the speech model and to measure their variations with time.
Since the excitation and impulse response of a LTI system are combined in a convolutional manner, the problem of speech analysis can also been viewed as a problem in separating the components of a convolution, called ”deconvolution”.
][*][][ nhnxny
19
Extraction of Features, V, Zheng-Hua Tan 37
Homomorphic deconvolution
Converts a convolution into a sum
Canonic form for system for homomorphic deconvolution
)(ˆ)(ˆ)(ˆ
)(*)()(
nhnxny
nhnxny
D*[ ] L[ ] D*-1[ ]
* + + + + *
)(nx
)(*)( 21 nxnx
)(ˆ nx )(ˆ ny )(ny
)(ˆ)(ˆ 21 nxnx )(ˆ)(ˆ 21 nyny )()( 21 nyny
Extraction of Features, V, Zheng-Hua Tan 38
* . . + + +
The characteristic system for homomorphic deconvolution
The characteristic system
Z[ ] log[ ] Z-1[ ])(nx )(zX )(ˆ zX )(ˆ nx
D*[ ]
20
Extraction of Features, V, Zheng-Hua Tan 39
Cepstral analysis
Observation:
taking logarithm of X(z), then
So, the two convolved signals are additive.
)()()(][*][][ 2121 zXzXzXnxnxnx
)(ˆ)(ˆ)(ˆ i.e.,
)}(log{)}(log{)}(log{
21
21
zXzXzX
zXzXzX
][ˆ][ˆ][ˆ 21 nxnxnx
Extraction of Features, V, Zheng-Hua Tan 40
Complex cepstrum and real cepstrum
Real cepstrum is the even part of
cepstrum was coined by reversing the first syllable in the word spectrum.
cepstrum |)(|log2
1][
cepstrumcomplex )}(log{2
1
)(ˆ2
1][ˆ
dweeXnc
dweeX
dweeXnx
jwnjw
jwnjw
jwnjw
][nc ][ˆ nx
21
Mel-frequency cepstral coefficience
MFCC
41Extraction of Features, V, Zheng-Hua Tan
Mel-frequency filter bank
Extraction of Features, V, Zheng-Hua Tan 42
22
Extraction of Features, V, Zheng-Hua Tan 43
Speech analysis: Cepstral analysis
Short-time speech analysis
Time-domain speech processing
Frequency-domain (spectral) processing
Linear predictive coding (LPC) analysis
Cepstral analysis
Filter bank analysis
Gammatone filters
Gammatone is a widely used model of auditory filters.
Impulse response (the product of a gamma distribution and sinusoidal tone):
where f is the frequency, φ is the phase of the carrier (tone), a is the amplitude, n is the filter's order, b is the filter's bandwidth, and t is time.
Extraction of Features, V, Zheng-Hua Tan 44
23
Gammatone filters
Extraction of Features, V, Zheng-Hua Tan 45
Extraction of Features, V, Zheng-Hua Tan 46
Summary
Short-time speech analysis
Time-domain processing
Frequency-domain (spectral) processing
Linear predictive coding (LPC) analysis
Cepstral analysis
Filter bank analysis