Audio signal classification
Contents:
Introduction to pattern classificationFeaturesFeature selectionClassification methods
Classification 2SGN-24006
1 Introduction to pattern classification
Refresher of the basic concepts of pattern classification:Introductory slides accompanying with the bookPattern Classification by Duda, Hart and Stork, John
Wiley & Sons, 2000.
Classification 3SGN-24006
Audio signal classification
Concrete problemsmusical genre classification, musical instrument recognitionspeaker recognition, language recognitionaudio context recognition (office vs. restaurant vs. street), videosegmenting based on audio, sound effects retrieval
Closely related to sound source recognition in humanssource recognition includes segmentation (perceptual soundseparation) in polyphonic signals
Many efficient methods have been developed in thespeech / speaker recognition field
Classification 4SGN-24006
Example: musical instrument recognition
Different acoustic properties of sound sources make them recognizableproperties result from sound production mechanism
But: even a single source produces varying soundswe must find someting characteristic to the source, not sound events alone:source invariants
Examples below: flute (left) and clarinet (right) what to measure?[Eronen&Klapuri,2000]
Classification 5SGN-24006
Supervised classification system
Simplified:A. Feature extractionB. Classification (using models)
1. training phase:learn fromexamples,collectstatisticsfor featurevalues
2. classify newinstancesbased on what waslearnt in the training phase
Featureextraction
Classify
Modeltraining
Models
recognitionresult
signal
Segmentation
Classification 6SGN-24006
2 Feature extraction
Feature extraction is inevitabletime-domain signal as such contains too much irrelevant data touse it directly for classification
Using appropriate features is crucial for successfulclassification
lousy features (with little discriminating power) can hardly
Classification 7SGN-24006
2.1 Spectral features
= Features that characterize the short-time spectrumUsually the most successful features for audio classificationAlso the most general-purpose features for different problems
in contrast, temporal features are typically different for musical genrerecognition, instrument recognition, or speaker recognition, for example
In extracting spectral featuresPhase spectrum is typically discarded
50% reduction of informationSpectral fine structure is usually discarded (in most tasks)
even more reduction of irrelevant informationRetain only the coarse spectral energy distribution
effective for general audio classificationbasis for speech and speaker recognition
Classification 8SGN-24006Spectral features
Cepstral coefficientsCepstral coefficients c(k) are a very convenient way tomodel spectral energy distribution
where DFT denotes the Fourier-tranform and IDFT its inverse
In Matlab
only real part, real( ), is taken because numerical precisionproduces infinidecimally small imaginary part
Cepstrum coefficients are calculated in short frames overtime
can be further modeled by calculating e.g. the mean and varianceof each coefficient over time
nxDFTIDFTkc log
Classification 9SGN-24006Spectral features
Cepstral coefficientsOnly the first M cepstrum coefficients are used as features
all coefficients model the precise spectrumcoarse spectral shape is modeled by the first coefficientsprecision is selected by the number of coefficients takenthe first coefficient (energy) is usually discarded
Usually M = fs / (2000 Hz) is a good first guess for MFigure: piano spectrum (thin black line) and spectrum modeled withthe first 50 coeffs (thick blue line) or 10 coeffs (red broken line)
Frequency (Hz)
Log
mag
nitu
de
Classification 10SGN-24006Spectral features
Cepstral coefficientsA drawback of the cepstral coefficients: linear frequency scalePerceptually, the frequency ranges 100 200Hz and 10kHz 20kHzare approximately equally important
the standard cepstral coefficients do not take this into accountlogarithmic frequency scale would be better
Why mimic perception?typically we want to classify sounds according to perceptual (dis)similarityperceptually relevant features often lead to robust classification, too
Desirable for features:small change in feature vector small perceptual change(and vice versa)Mel-frequency cepstral coefficients fulfill this criterion
Classification 11SGN-24006Spectral features
Frequency and magnitude warping
Linear scaleusuallyhard to seeanything
Log-frequencyeach octave isapproximatelyequally importantperceptually
Log-magnitudeperceived changefrom 50dB to 60dBabout the same asfrom 60dB to 70dB
Classification 12SGN-24006Spectral features
Mel-frequency cepstral coefficientsImproves over thestandard cepstralcoefficients
Block diagram: calculationsfor one analysis frame
Mel frequency scale:7001log2595 10 ffMel
Discrete Fourier transform
Simulate Mel filterbank
Power at the output of each filter
Discrete cosine transform
Frame blocking, windowingSignal
MFCC
Classification 13SGN-24006Spectral features
Mel-frequency cepstral coefficients
Some reasons why MFCCs are successful:
Mel-frequency scaleLog of power
Discrete cosine transform
MFCCs are usually concatenated to the feature vector toinclude information about temporal variation of features
let vt be the feature vector, then the first-order delta feature is
the delta features are added after the static features in the featurevector
Large change in MFCC vectorlarge perceptual change
Drop spectral fine structure,decorrelate the features
D = +1 - -1
2
Classification 14SGN-24006Spectral features
Other spectral features
Spectral centroid (correlates with brightness)
where |X(k)| is the magnitude of frequency component kBandwidth
K
k
K
kkXkXkSC
1
2
1
2 )()(
K
k
K
kkXkXSCkBW
1
2
1
22 )()()(
Classification 15SGN-24006
2.2 Temporal features
Characterize the temporal evolution of an audio signalTemporal features tend to be more task-specificMid-level representation for feature extraction:
power envelope of the signal sampled at 100Hz...1kHz rateor: power envelopes of the signal at 3...40 subbands
Note: also these drop the phase spectrum and all spectralfine structure!
Classification 16SGN-24006
Temporal features
Musical instrument classification:rise time: time interval between the onset and instant of maximalamplitudeonset asynchrony at different frequenciesfrequency modulation: amplitude and rate (vibrato: 4 8Hz)amplitude modulation: amplitude and rate
tremolo, roughness (fast amplitude modulation at 15-300 Hz rate)
General audio classificatione.g. amplitude modulationalso MFCCs can be seen as temporal features
Classification 17SGN-240062.3 Features calculated in the time
domainSometimes (computationally) ultra-light features are needed and theFourier transform is avoidedZero-crossing rate
Figure: ZCR correlates strongly with spectral centroid (~ brightness)spectral centroidzero-crossing rate[Peltonen, MSc thesis, 2001]
Short-time energy
lousy feature as such, but different statistics of STE are useful
N
nnxsignnxsign
NZCR
2
))1(())((1
0,10,00,1
)(x
xx
xsign
N
nnx
NSTE
1
2)(1
Classification 18SGN-24006
More features...
Table: some featuresfor musical instrumentrecognition[Eronen,Klapuri,2000]
One can measuremany things...
Classification 19SGN-24006
3 Feature selection and transformation11) This section is based on lecture notes of Antti Eronen
Let us consider methods to select a set of features from alarger set of available featuresThe number of features at the disposal of the designer isusually very large (tens or even hundreds)Reasons for reducing the number of features to a
less features simpler models less training data needed(remember the curse of dimensionality )amount of training data: the higher the ratio of the number oftraining patterns Nbetter the generalisation propertiescomputational complexity, correlating (redundant) features
Classification 20SGN-24006
Feature selection
Feature selection (or, reduction) problem:reduce the dimension of the feature vectors while preserving asmuch class discriminatory information as possible
Good features result in large between-class distance andsmall within-class variance
examine features individually and discard those with littlediscrimination abilitya better way is to examine the features in combinationslinear (or nonlinear) transformation of the feature vector
Classification 21SGN-24006
3.1 Data normalization
function than features with small values (e.g. Euclidean distance)For N available data points of the dth feature (d = 1, 2, ..., D);
The resulting normalized features have zero mean and unit varianceif desired, feature weighting can be performed separately after normalizat.
N
nndd x
Nm
1
1
N
ndndd mx
N 1
22
11
d
dndnd
mxx
mean:
variance:
normalizedfeature:
Classification 22SGN-24006
Mean, variance, correlation
x1
x2
Classification 23SGN-24006
3.2 Principal component analysis
Idea: A such that when applied to the featurevectors x, the resulting new features y are uncorrelated.
the covariance matrix of the feature vectors y is diagonal.Define the covariance matrix of x by (here we use the training data!)
Cx= E{( x m)( x m)T }.Because CxD orthonormal eigenvectors, and hence it can be diagonalised:
A Cx AT =
Above, A is an D by D matrix whose rows are formed from theeigenvectors of Cx and is a diagonal matrix with the correspondingeigenvalues on its diagonalNow the transform can be written as
y = A(x m)
where m is the mean of the feature vectors x.
Classification 24SGN-24006Principal component analysis
Eigenvectors
Recall the eigenvector equations:
Ced = ded
(C dI)ed = 0
where ed are the eigenvectors and d are the eigenvaluesand C is square matrixIn Matlab:
The transform matrix A in y = A(x m) consists of theeigenvectors of Cx: A = [e1, e2, ..., eD]T
Since Cx is real and symmetric, A is orthonormal (AAT = I, AT = A 1)
Classification 25SGN-24006Principal component analysis
Transformed features
The transformed feature vectors y = A(x m) have zeromean and their covariance matrix (ignoring the meanterms):
Cy = E{ A(x m) (A(x m))T } = ACxAT
i.e. Cy is diagonal and the features are thus uncorrelated.Furthermore, let us denote by the diagonal matrixwhose elements are the square roots of the eigenvaluesof Cx. Then the transformed vectors
y A( x m)
have uncorrelated elements with unit variance:
Cy A) Cx A)T = ... = I.
Classification 26SGN-24006Principal component analysis
Transformed features
Mean removal: y = x m
Classification 27SGN-24006Principal component analysis
Transformed featuresLeft: decorrelated features basis vectors orthogonal e1
Te2 = 0,Right: variance scaling basis vectors orthonormal e1
Te1 = e2Te2 = 1
Classification 28SGN-24006
Principal component analysis
The transformation can be used to reduce the amount of needed dataA corresponds to
the largest eigenvalue and the last row to the smallest eigenvalueM principal components to create an M by D matrix A for
the data projection
We obtain an approximation for x, which essentially is the projectionof x onto the subspace spanned by the M orthonormal eigenvectors.This projection is optimal in the sense that it minimises the meansquare error (MSE)
for any approximation with M components
)( mxx MA
2xxE
Classification 29SGN-24006Principal component analysis
Practical use of PCAPCA is useful for preprocessing features before classification
Note: PCA does not take different classes into account it only considersthe properties of different features
If two features xi and xj are redundant, then one eigenvalue in A isvery small and one dimension can be dropped
we do not need to choose between two correlating features!: it is better todo the linear transform and then drop the least significant dimensionboth of the correlating features are utilized
You need to scale the feature variances before eigenanalysis, sinceeigenvalues are proportional to the numerical range of the features
procedure: 1. normalize PCA are there unnecessary dimensions?
Example of correlating features:ZCR and spectral centroid
Classification 30SGN-24006
3.3 Class separability measures
PCA does not take different classes into accoutfeatures remaining after PCA are efficient in characterising sounds but donot necessarily discriminate between different classes
Now we will look at a set of simple criteria that measure thediscriminating properties of feature vectorsWithin-class scatter matrix (K different classes)
where Ck = E((x mk)( x mk)T ) is the covariance matrix for class kand p k) the prior probability of the class k, i.e., p k) = Nk N, whereNk is the number of samples from class k (of total N samples).trace {Sw} is the sum of the diagonal elements of Sw and herequantifies the average within-class variance of the features over allclasses
K
kkkw CpS
1
)(
Classification 31SGN-24006
Class separability measures
Between-class scatter matrix
where mk is the mean of class k and m0 is the global meanvector
trace{Sb} is a measure of the average distance of themean of each class from the global mean value over allclasses
K
k
Tkkkb pS
100 ))()(( mmmm
K
kkkp
10 )( mm
Classification 32SGN-24006
Class separability measures
It obtains large values whensamples in the D-dimensional space are well clustered aroundtheir mean within each class (small trace{Sw} )the clusters of the different classes are well separated(large trace{Sb})
w
b
StraceStraceJ1
Classification 33SGN-24006
3.4 Feature subset selection
Problem: how to select a subset of M features from the Doriginally available so that1. we reduce the dimensionality of the feature vector (M < D)2. we optimize the desired class separability criterion
optimal subset of features, we should formall possible combinations of M features out of the Doriginally available
the best combination is then selected according to any desiredclass separability measure J
In practice, it is not possible to evaluate all the possiblefeature combinations!
Classification 34SGN-24006Feature vector selection
Sequential backward selection (SBS)
Particularly suitable for discarding a few worst features:1. Choose a class separability criterion J, and calculate its
value for the feature vector which consists of all availablefeatures ( length D)
2. Eliminate one feature, and for each possible resultingcombinations (of length D 1) compute J. Select the best.
3. Continue this for the remaining features, and stop whenyou have obtained the desired dimension M.This is a suboptimal search procedure, since we cannotguarantee that the optimal r 1 dimensional vector has tooriginate from the optimal r dimensional one!
Classification 35SGN-24006Feature vector selection
Sequential forward selection (SFS)Particularly suitable for finding a few golden features:
1. Compute criterion J value for all individual features.Select the best.
2. Form all possible two-dimensional vectors that containthe winner from the previous step. Calculate the criterionfor each vector and select the best.
3. Continue adding features one at time, taking always theone that results in the largest value of the criterion J.
4. Stop when the desired vector dimension M is reached.Both SBS and SFS suffer from the nesting effect: once afeature is discarded in SBS (selected in SFS), it cannotbe reconsidered again (discarded in SFS).
Classification 36SGN-24006
3.5 Linear transforms
Feature generation by combining all the D features toobtain M (with M < D) features
A that transforms the original featurevectors x to M-dimensional feature vectors such that
y = Ax
Objectives:1. reduce the dimensionality of the feature vector2. optimize the desired class separability criterion
Note the similarity with PCA the difference is that herewe consider the class separability
Classification 37SGN-24006Linear transforms
Linear discriminant analysis (LDA)
Choose A to optimize
Very similar to PCA. The difference is that here theeigenanalysis is performed for the matrix Sw
1 Sb instead ofthe global covariance.The rows of the transform matrix are obtained by choosingthe M largest eigenvectors of Sw
1 Sb
Leads to a feature vector dimension M K 1 where K isthe number of classes
this is because the rank of Sw1 Sb is K 1
bw SStraceJ 12
Classification 38SGN-24006
4 Classification methods
Goal is to classify previously unseen instances after theclassifier has learned the training dataSupervised vs. unsupervised classification
supervised: the classes of training instances are told duringlearningunsupervised: clustering into hitherto unknown classes
Supervised classification is the focus here
Example on the next pagedata is represented as M-dimensional feature vectors (here M=2)18 different classesseveral training samples from each class
Classification 39SGN-24006Classification methods
Example
New instance to classify
Classification 40SGN-24006
4.1 Classification by distance functions
Minimum distance classificationcalculate the distance D(x, yi) between the unknown sample x andall the training samples yi from all classesfor example the Euclidean distancechoose the class according to the closest training sample
k-nearest neighbour (k-NN) classifierpick k nearest neighbours to x and then choose the class which wasmost often picked
These are a lazy classifierstraining is trivial: just store the training samples yi
classification gets complex with a lot of training datamust measure distance to all training samples
Computational efficiency can be improved by storing only a sufficientnumber of class prototypes for each class
or by using efficient indexing techniques (locality sensitive hashing)
yxyxyx TED ,
Classification 41SGN-24006Classification by distance functions
Distance metrics
Choice of the distance metric is very importantEuclidean distance metric:
sqrt is order-preserving, thus it is equivalent to minimize
Mahalanobis distance between x and y
where C is the covariance matrix of training data.
Mahalanobis distance DM is generally a good choiceComment: k-NN is a decent classifier yet very easy toimplement
ddd
TE yxD 2)(, yxyxyx
yxyxyx TED ,2
yxyxyx 1, CD TM
Classification 42SGN-24006
Classification by distance functions
Example: decision boundaries for 1-NNclassifier
Left: Euclidean distance, axes not scaledundesired weighting (elongated horizontal contours)
Right: Euclidean distance, axes scaled [Peltonen, MSc thesis, 2001]
Classification 43SGN-24006
Classification by distance functions
Example: decision boundaries for 1-NNclassifier
Left: Euclidean distance, axes scaledRight: Mahalanobis distance scales the axes and rotates the featurespace to decorrelates the features [Peltonen, MSc thesis, 2001]
Classification 44SGN-24006
4.2 Statistical classification
Idea: interpret the feature vector x as a random variable whosedistribution depends on the classBayes formula a very important tool:
where P( i|x) denotes the probability of class i given an observedfeature vector xNote: we do not know P( i|x), but P(x| i) can be estimated from thetraining dataMoreover, since P(x) is the same for all classes, we can performclassification based on (MAP classifier):
A remaining task is to parametrize and learn P(x| i) and to define (orlearn) the prior probabilities P( i)
xxx
PPPP ii
i
iiii PP xmaxarg
Classification 45SGN-24006Statistical classification
Gaussian mixture model (GMM)GMMs are very convenient for representing P(x| i), i.e., themultidimensional distribution of features for the class i
Weighted sum of multidimensional Gaussian distributions
where wi,q are weights, N(x, , ) is a Gaussian distribution, and qwi,q=1
Figure:GMM in onedimension[Heittola,MSc thesis, 2004]
Q
qqiqiqii NwP
1,,, ,;xx
Classification 46SGN-24006Statistical classification
Gaussian mixture model (GMM)GMMs can fit any distribution given a sufficient number of Gaussians
successful generalization requires limiting the number of components
GMM is a parametric modelparameters: weights wi,q, means i,q, covariance matrices i,q
diagonal covariance matrices can be used if features are decorrelatedmuch fewer parameters
Parameters of the GMM of each class are estimated to maximizeP(x| i), i.e., probability of the training data for each class i
iterative EM algorithmToolboxes: insert training data get the model use to classify new
xxx 1
21exp
21,; T
DN
Q
qqiqiqii NwP
1,,, ,;xx
Classification 47SGN-24006Statistical classification
MAP classification
After having learned P(x| i) and P( i) for each class,maximum a posteriori classification of a new instance x:
Usually we have a sequence of feature vectors x1:T and
(assuming that feature vectors are i.i.d. given the class)since logarithm is order-preserving we can maximize
iiii PP xmaxarg
T
tiitii PP
1
maxarg x
iit
T
ti
T
tiitii
PP
PP
x
x
1
1
logmaxarg
logmaxarg
Classification 48SGN-24006Statistical classification
Hidden Markov model (HMM)HMMs account for the time-varying nature of sounds
probability of observation x in state j: typically GMM model bj(x)transition probabilities btw states: matrix A with aij = P(qt=i|qt 1=j)the state variable is hidden, and usually not of interest in classification
Train own HMM for each classHMMs allowcomputing
where x1:T isan observationsequenceFigure: [Heittola,MSc thesis, 2004]
T
iiTTi P
PPP
:1
:1:1 )(
xx
x
Classification 49SGN-24006Statistical classification
Hidden Markov model (HMM)
Hidden Markov models are a generalization over GMMsGMM = HMM with only one state (when using GMMs as state-conditional observation distributions)
HMMs do not necessarily perform better than GMMsonly if some temporal information is truly learned to the statetransition probability matrix, then HMMs are useful e.g. speechExample: genre classification, general audio classification
In speech recogntion, HMMs are a key technology(sequential order of phonemes is crucial)
More about HMMs on speech recognition lectures