2000/05/03 1
Speaker Identification using Gaussian Mixture Model
Presented by CWJ
2000/05/03 2
Reference
D. A. Reynolds and R. C. Rose, “Robust Text-
Independent Speaker Identification Using
Gaussian Mixture Speaker Models”, IEEE Trans.
on Speech and Audio Processing, vol.3, No.1,
pp.72-83,January 1995.
2000/05/03 3
Outline
1. Introduction to Speaker Recognition
2. Gaussian Mixture Speaker Model (GMM)
3. Experimental Evaluation
2000/05/03 4
Introduction to Speaker Recognition
1. Two tasks of Speaker Recognition
-- Speaker Identification (this paper)
e.g. voice mail labeling
-- Speaker Verification
e.g. financial transactions
A. Some definitions of S.R.
2000/05/03 5
2. Two forms of spoken input
-- Text-dependent
-- Text-independent (this paper)
3. System Range
-- Closed Set (this paper)
-- Open Set
2000/05/03 6
B. Several Methods used in Speaker
Recognition
VQ
NN
1985 1995HMM
VQ
NN
GMM
HMM
VQ
NN
2000/05/03 7
1. Use long-term averages of acoustic features
(spectrum,pitch…) first and earliest
Idea :
To average out the factors influencing
intra-speaker variation, leave only
the speaker dependent component.
Drawback : required long speech utterance(>20s)
2000/05/03 8
2. Training SD model for each speaker
Explicit segmentation
HMM
Implicit segmentation
VQ,GMM
2000/05/03 9
HMM:
Advantage : Text-independent
Drawback : a significant increase in
computational complexity
VQ:
Advantage : unsupervised clustering
Drawback : Text-dependent
2000/05/03 10
3. The use of discriminative Neural Network (NN)
※ model the decision function which best discriminate speakers
Advantage : less parameters, higher performance compared to VQ model Drawback : The network must be retrained when a new speaker is added to the system.
2000/05/03 11
GMM :
Advantage : Text-Independent
probabilistic framework (robust)
computationally efficient
easily to be implemented
2000/05/03 12
The Gaussian mixture model (GMM)
A. Model Interpretations
Speech Recognition
(GMM) State Level
2000/05/03 13
Speaker RecognitionSpeaker k
1
1
2
2
1p 2p
……………………
i
i
ip
Acousticclass
1. Each Gaussian component models an acoustic class
2000/05/03 14
2. GMM gives the arbitrarily-shaped densities a better
approximation.
2000/05/03 15
2000/05/03 16
B. Signal Analysis
2000/05/03 17
C. Model Description
Gaussian Mixture Density
)()|(1
xbpxpM
iii
Where x
D-dimensional random vector
)()'(
2
1exp
)2(
1)( 1
212 iii
iDi xxxb
iiip ,, Mi ,,1
Nodal, Grand,Global
Nodal, diagonal (this)
2000/05/03 18
D. ML Parameter Estimation
Step:
1. Beginning with an initial model
2. Estimate a new model such that
3. Repeated 2. until convergence is reached.
)|()|( XpXp
2000/05/03 19
Mixture Weights
Means
Variances
T
tti xip
Tp
1
),|(1
T
t t
T
t tti
xip
xxip
1
1
),|(
),|(
2
1
1
22
),|(
),|(iT
t t
T
t tti
xip
xxip
M
k tkk
tiit
xbp
xbpxip
1)(
)(),|(
2000/05/03 20
E. Speaker Identification
a group of speakers S = {1,2,…,S} is represented by
GMM’s λ1, λ2, …, λs
)(
)Pr()|(maxarg)|Pr(maxargˆ11 Xp
XpXS kk
Skk
Sk
)|(maxargˆ1
kSk
XpS
)|(logmaxargˆ1
1kt
T
tSk
xpS
T
ttiikt xbpxp
1
)()|( which
logtake
2000/05/03 21
Experimental Evaluation
A. Performance Evaluation
,,,,, 21
1
21 TT
Segment
T xxxxx
e.g. frame rate = 10ms, T = 500
the length of a test utterance = 5 seconds
,,,,, 2
2
121 T
Segment
TT xxxxx
2000/05/03 22
% correct identification =
# of correctly identified segments
total # of segments
×100
2000/05/03 23
C. Algorithmic Issues
1. Model Initialization :
-- Use SI,context dependent subword HMM’s
mean and their global variance.
-- Randomly choose 50 vectors for initial
model mean, and an identity matrix for the
starting covariance matrix
2000/05/03 24
2. Variance Limiting :
When training a nodal variance GMM
the magnitude of variance
so, give the constraint
2min
2
2min
2
2min
22
i
iii if
if
The min variance, is determined empirically.2min
2000/05/03 25
3. Model Order :
I. Performance versus model order.
1,2,4,8,16,32,64
2000/05/03 26
II. Performance for different
amounts of training data
and model orders
III. Performance versus
model order for trained
with 30,60,and 90s of
speech.
2000/05/03 27
4. Spectral Variability Compensation :
1) Frequency Warping :
Nfff
fff
minmax
min'
Nf : original Nyquist frequency
2000/05/03 28
2) Spectral Shape Compensation :
Assumption :
ChannelSpeaker Signal Processing
f
Frequency response
mel-cepstral feature vector
hxz
2000/05/03 29
‧mean normalization for T.I. channel filter (CMS)
T
ttzT
m1
1 mzz tcompt
‧use “channel invariant” feature (delta-cepstral)
2000/05/03 30
5. Large Population Performance :