107
CHAPTER-5
SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION USING
ERGODIC HIDDEN MARKOV MODEL
In the previous chapter, we have discussed the source
features for speaker recognition using GMM. In this chapter, the
effectiveness of HMM model to capture the complete source features
at subsegmental, segmental and suprasegmental levels of LP
residual, HE of LP residual, RP of LP residual and fusion of HE and
RP of LP residual from which features are extracted and illustrated
in the speaker recognition. The main objective of this chapter is to
implement speaker recognition system using the ergodic HMM.
Firstly we start with a brief discussion on HMM. This chapter is
organized as follows: the analysis of HMM is presented in Section
5.1. Section 5.2 introduces the extraction of features from
subsegmental, segmental and suprasegmental levels processing of
LP residual signal. Section 5.3 explains the Viterbi algorithm and
its application in speaker recognition. Section 5.4 holds database
used for experimental study. Section 5.5 deals with residual
features based Speaker recognition using continuous ergodic HMM.
5.6 deals with HE features and RP features based Speaker
recognition using continuous ergodic HMM. In the section 5.7, the
improved speaker recognition system is demonstrated the
combination of both the HE features and RP features from each
level using ergodic HMM. Section 5.8 gives comparison study of
speaker recognition for residual features, HE of LP residual, RP of
LP residual and the fusion of HE and LP residual features by the
ergodic HMM. In Section 5.9 summary of this chapter is laid out.
108
5.1 SPEAKER RECOGNITION USING HIDDEN MARKOV MODEL (HMM)
Hidden Markov model (HMM) is a widely used statistical
method of characterizing the temporal properties of the time varying
frames of a pattern. Using HMM, the parameters of a stochastic
process can be estimated in a precise and well defined manner. For
modeling speech patterns, HMMs are suitable as speech can be
characterized as a parametric random process. HMM can absorb
the durational variations, and captures temporal sequencing among
the sounds. Hence, HMM based systems are well suited for speech
recognition applications.
5.1.1. Hidden Markov Model (HMM)
HMMs are similar to finite state diagrams, except that the
states in a HMM are hidden. Each transition in the state diagram of
a HMM has a transition probability associated with it. These
transition probabilities are denoted by matrix A. Here A is defined
as A=aij where aij=P (it+1 =j | it=i) the probability of being in state j at
time t+1, given that we were in state i at time t. it is assumed that
aij is independent of time.
Each state is associated with the set of continuous
observations where each set has a continuous observation
probability density. These observation symbol probabilities are
denoted by the parameters B. Here B is defined as B=bj (k), bj (k)
=P(vk at t | it = j), the probability of observing the symbol vk given
that we are in state j. The initial state probability is denoted by the
matrix π, where π is defined as π=πi, πi=P( it = i) the probability of
109
being in state i at t=1. Using the three parameters A, B and, π a
HMM can be compactly denoted as λi = (A, B, π).
There are three fundamental problems associated with HMMs
[105],which are: i) computing the likelihood of an observation
sequence which is given to a particular HMM, ii) determining the
best HMM state sequence associated with a given observation
sequence and iii) estimating the parameters of HMM, which
maximizes the likelihood of a given observation sequence. The
problem (i) is useful in the speaker recognition phase. Here, for the
given parameter sequence (observation sequence) derived from the
test speech utterance, the likelihood value of each HMM is
computed using the forward procedure [88]. Here one HMM
corresponds to one speaker. The speaker associated with the HMM,
for which the likelihood is maximum, is identified as the recognized
speaker corresponding to the input speech utterance. Problem (iii)
is associated with training of the HMM for the given speech unit.
The parameters of HMM, λ, have been iteratively refined for
maximum likelihood estimation by using Baum-Welch algorithm
[106]. Parameter estimation of the HMMs, where each state is
associated with mixtures of multi-variated densities which have
been demonstrated in [103]. The Viterbi algorithm [107 and 108] is
utilized for solving the problem (ii) as it is computationally efficient.
The objective of HMM-based speaker recognition system is to
accurately estimate the parameters of the HMM from a training
dataset.
110
5.1.2. Left-Right HMM
HMMs are characterized based on their transition matrix
A={aij}. The property for a left-right model is aij = 0, j<i. i.e., no
transition is allowed to state whose indices is less than the current
state which is shown in the Fig. 5.1. Further the initial state
probabilities exhibits the following property
πi = {
For a three state left-right model the state transition matrix is given
as
A ={aij} =
Continuous HMMs can capture speaker-specific features effectively
than the discrete HMM [109]. Continuous Left-Right HMM based
speaker recognition systems can capture only the underlying
pattern in temporal sequence of sounds [110], which gives good
recognition performance for text-dependent speaker recognition
systems. Whereas for text-independent speaker recognition
systems, as the time varying text information is completely absent,
continuous Left-Right HMM may not give good speaker recognition
performance.
111
Fig. 5.1: A Three State Continuous Left-Right HMM.
5.1.3. Continuous Ergodic HMM: The Desirable Statistical Model for Speaker Recognition
The structure for an ergodic model is defined by its transition
matrix A ={ } i,j. The other name for an ergodic model is “fully
connected HMM” as shown in the Fig. 5.2. In this model each state
can be reached from every other state of the model. The property of
an ergodic HMM is given by 0<aij < 1. The state transition matrix of
three state ergodic models is given by
112
A= {aij} =
The structural property of continuous ergodic HMM is such that it
not only captures underlying pattern in temporal sequencing of
sound units but also the patterns which are non-temporal in
nature. Hence to capture both categories of underlying patterns
continuous ergodic HMM is intended to use in the thesis for text-
independent speaker recognition.
Fig. 5.2: A Three State Continuous Ergodic HMM.
5.2 EXTRACTION OF FEATURES AT THREE LEVELS
The speech signal from the given speaker is collected, which is
sampled at 16 KHz and it is resample at 8 KHz. A frame size of
5ms and a frame shift of 2.5ms have been taken to calculate
subsegmental level of LP residual. Similarly for segmental and
suprasegmental processing of LP-residual, HE of LP residual and
RP of LP residual as explicated in the chapter 4.
113
5.3 VITERBI ALGORITHM AND IT’S APPLICATION TO SPEAKER RECOGNITION Viterbi algorithm is used in speaker recognition task where
one HMM has been trained for each speaker. Observation sequence
is derived from a speech utterance by the Viterbi algorithm to find
the most likely state sequence and the likelihood value associated
with this most likely state sequence in a given HMM [30, 111, 105
and 109]. A set of HMMs trained for predetermined set of speakers.
Viterbi algorithm can be used during the recognition phase to
determine the HMM from the set of HMMs that matches best for a
given input observation sequence. This application of Viterbi
algorithm is demonstrated in Fig. 5.3. Fig. 5.3 demonstrates a
recognition system with three ergodic HMMs. Optimal state
sequence for each HMM has been denoted with a thick line. The
likelihood value associated with each optimal state sequence is
computed, and the HMM corresponding to the maximum likelihood
has been identified.
The observation made in the Viterbi algorithm is that, for any
state at time t, there is only one most likely path to that state.
Therefore, if several paths converge to a particular state at time t,
instead of recalculating all of them when calculating the transitions
from this state to states at time t+1, one can discard the less likely
paths, and only use the most likely one in calculations. When this
is applied to each time step, the number of calculations is reduced
to T.N2, which is much lesser than TN computations in brute force
method. The steps involved in Viterbi algorithm are presented in the
following section 5.3.1.
114
Fig. 5.3: Finding the Optimal State Sequence in Ergodic HMM based Speaker Recognition System.
5.3.1. Viterbi Algorithm
In the state sequence estimation problem, a set of T
observations, O = {O1O2……OT} and a N state HMM, λ are given. The
goal is to estimate the state sequence, S = {s(1), s(2),….,s(T)} which
maximizes the likelihood L(O|S, λ). Determining the most likely
state sequence can be solved by using dynamic programming [104
and 105]. Let j (t) represent the probability of the most likely state
sequence for observing vectors o1 through ot, while at state j, at
time t, and Bj (t) represents the state which gives this probability,
then j (t) and Bj (t) can be expressed as
115
j(t)=maxi{ j(t-1)ij}bj (ot) (5.1)
Bj(t)=arg (maxi j(t-1i j}bj(ot)) (5.2)
Using initial conditions
j (1) = 1 (5.3)
B i (1) = 0 (5.4)
j (1) = ai j bj(ot) for 1< j (5.5)
Bj(1) = 1 (5.6)
In Eq. 5.1, the probability j (t) is computed using a recursive
relation. Using Bj(t) and assuming that the model must end in the
final state at time T, (s(T) = N), the sequence of states for the
maximum likelihood path can be recovered recursively using the
equation.
S(t-1) = Bs(t)(t). (5.7)
In other words, starting with s(T) known, Eq. 5.7 gives the
maximum likelihood state at time T-1(e.g. s(T-1) = Bs(t)(t) = BN(t) ).
5.4 DATABASE USED FOR EXPERIMENTAL STUDY
As mentioned in the section 4.3.1, the TIMIT corpus of read
speech has been used to evaluate the speaker recognition system.
We have considered 38 speakers for different training and testing
utterances. Throughout this study, closed set identification
experiments are done to manifest the feasibility of capturing the
speaker- specific information from the LP residual, HE and RP of LP
residual at subsegmental, segmental and suprasegmental levels.
The following sections exemplify the speaker recognition
performance using ergodic HMM.
116
5.5 FOR SPEAKER RECOGNITION USING CONTINUOUS ERGODIC HMM AT THREE LEVELS
The system has been implemented in Matlab7 on windows XP
platform. We have used LP order of 12 for all experiments. We have
trained the model (HMM) using Gaussian components as 2, 4, 8 16
and 32 at each state for training speech utterances and testing
speech utterances. The steps involved in the proposed algorithm for
text independent speaker recognition system for subsegmental,
segmental and suprasegmental features from LP residual are as
follows:
Training Phase for subsegmental level:
For each speaker Pj from speaker list N do
For each speech signal Si of speaker Pj
Preprocess of speech Si
Compute Ŝi using LP approximation
Compute LP residual
ei = Si – Ŝi
for each sample of ei from K samples do
Extract subsegmental features fk from ei at subsegmental level
end
end
Initialize HMM model parameters λj = (A,B,π)
117
Train λj for optimal solution using EM algorithm
end
Testing Phase for Subsegmental level
For each speaker Pj from speaker list N do
For each speech signal Si of speaker Pj
Preprocess of speech Si
Compute Ŝi using LP approximation
Compute LP residual
ei = Si – Ŝi
for each sample of ei from K samples do
Extract subsegmental features fk from ei at subsegmental level
end
end
for each model λ1 λ2….. λN do
Using the Viterbi decoding process calculate P (O| λj), where P(O|λj)
is the probability of the observation sequence O(o1o2……oT)
end
Calculate 1-best result for a given testing speech signal using
arg. j)
118
end
Training Phase for segmental level:
For each speaker Pj from speaker list N do
For each speech signal Si of speaker Pj
Preprocess of speech Si
Compute Ŝi using LP approximation
Compute LP residual
ei = Si – Ŝi
for each sample of ei from K samples do
Extract subsegmental features fk from ei at subsegmental level
end
end
Initialize HMM model parameters λj = (A, B, π)
Train λj for optimal solution using EM algorithm
end
Testing Phase for Segmental level:
For each speaker Pj from speaker list N do
For each speech signal Si of speaker Pj
Preprocess of speech Si
119
Compute Ŝi using LP approximation
Compute LP residual
ei = Si – Ŝi
for each sample of ei from K samples do
Extract subsegmental features fk from ei at subsegmental level
end
end
for each model λ1 λ2….. λN do
Using the Viterbi decoding process calculate P (O| λj), where P(O| λj)
is the probability of the observation sequence O (o1o2……oT)
end
Calculate 1-best result for a given testing speech signal using
arg. j)
end
Training Phase for Suprasegmental level:
For each speaker Pj from speaker list N do
For each speech signal Si of speaker Pj
Preprocess of speech Si
Compute Ŝi using LP approximation
120
Compute LP residual
ei = Si – Ŝi
for each sample of ei from K samples do
Extract subsegmental features fk from ei at subsegmental level
end
end
Initialize HMM model parameters λj = (A,B,π)
Train λj for optimal solution using EM algorithm
end
Testing Phase for Suprasegmental level:
For each speaker Pj from speaker list N do
For each speech signal Si of speaker Pj
Preprocess of speech Si
Compute Ŝi using LP approximation
Compute LP residual
ei = Si – Ŝi
for each sample of ei from K samples do
Extract subsegmental features fk from ei at subsegmental level
end
121
end
for each model λ1 λ2….. λN do
Using the Viterbi decoding process calculate P(O| λj), where P(O| λj)
is the probability of the observation sequence O(o1o2……oT)
end
Calculate 1-best result for a given testing speech signal using
arg. j)
end
The speaker recognition rate is defined as the ratio of the number of
speakers recognized to the total number of speakers tested. We
have calculated speaker recognition rate for various model
parameters such as numerous values of Gaussian mixtures, and
numerous values of Hidden Markov model states. The tabulated
recognition performance at subsegmental, segmental and
suprasegmental levels of LP residual and the corresponding charts
for different model parameters are shown in Figs. 5.4 to 5.10 and
Tables 5.1 to 5.5.
Fig. 5.4 and Table 5.1 show a two-states ergodic HMM
speaker recognition performance for different number of Gaussian
mixture components. The system is trained with 8 speech
utterances and tested with 2 speech utterances per speaker. As
demonstrated in the Fig. 5.4 and Table 5.1 the recognition
performance of subsegmental, segmental and suprasegmental levels
of LP residual , combine feature scores of each level of LP
residual(complete source features or residual features) and along
122
with MFCCs were found to be 85.33%, 100% ,100%, and 95.33%
for a model with 32 Gaussian components respectively. From the
above illustration, it is observed that as the number of mixture
components increases the speaker recognition rate also increases.
The performance of speaker recognition system have been given in
the form of percentile(%) in the all the Tables of this chapter.
Table 5.1: Recognition performance of subsegmental (Sub), segmental (Seg) and suprasegmental levels and their combination of LP residual along with MFCCs at single-state ergodic HMM.
No. of mixtures
Sub (%)
Seg (%)
Supra (%)
SRC=Sub+ seg+supra
(%)
MFCC’s (%)
SRC+ MFCC’s
(%)
2 6.67 23.33 100 55.18 80 65.5
4 63.33 36.67 100 75.55 95 83
8 70 83.33 100 84.68 95 90
16 80 100 96.67 92.11 90 92
32 85.33 100 100 95.67 95 95.33
123
Fig. 5.4: Single State Ergodic HMM Recognition Performance for Varying Number of Mixture Components of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+supra along with MFCCs.
124
Table 5.2: Recognition Performance of sub, seg and supra levels of LP residual and their combination along with MFCCs for two states ergodic HMM.
No. of mixtures
Sub (%)
Seg (%)
Supra (%)
SRC= Sub+seg+
supra (%)
MFCCs (%)
SRC+ MFCCs
(%)
2 42.22 23.33 100 55.18 80 65.5
4 63.33 63.33 100 75.55 95 83
8 70 83.33 100 84.67 95 89.88
16 80 96.33 96.67 92.11 90 91.11
32 85.33 100 100 95.67 95 95.33
125
Fig. 5.5: Two-States Ergodic HMM recognition Performance for Varying Number of Mixture Components of a) Sub, Seg and
Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along with MFCCs.
126
Table 5.3: Recognition performance of Sub, Seg and Supra levels of LP residual and their combination along with MFCCs for three states ergodic HMM.
No.Of mixtures
Sub (%)
Seg (%)
Supra (%)
SRC=Sub+seg+supra
(%)
MFCCs (%)
SRC+MFCC’s (%)
2 16.67 36.67 100 51.11 50 50.5
4 3.33 63.33 100 55.56 70 62.77
8 20 93.33 100 71.11 80 75.56
16 26.33 96.33 96.67 64.11 86.67 70.39
32 56.67 100 100 85.22 76.67 83.33
127
Fig. 5.6: Three States Ergodic HMM Recognition Performance for Varying number of Mixture Components of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along
with MFCCs.
128
Table 5.4: Recognition performance of Sub, Seg and Supra levels of LP residual and their combination along with MFCCs for four states ergodic HMM.
.
No. of mixtures
Sub (%)
Seg (%)
Supra (%)
SRC=Sub+ seg+supra
(%)
MFCCs (%)
SRC+MFCCs (%)
2 3.33 26.67 93.33 41.11 20 30.55
4 6.67 66.67 100 57.77 53.33 55.55
8 3.33 80 100 61.11 53.33 57.22
16 3 .33 86.67 96.67 62.22 73.33 67.77
32 3.33 96.67 93.33 64.44 56.67 60.55
129
Fig. 5.7: Four States Ergodic HMM Recognition Performance for Varying Number of Mixture Components of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along with MFCCs.
130
Table 5.5: Recognition performance of Sub, Seg and Supra levels of LP residual and their combination along with MFCCs for five-states ergodic HMM.
No. of mixtures
Sub (%)
Seg (%)
Supra (%)
SRC= Sub+seg+supra (%)
MFCCs (%)
SRC+MFCC’s (%)
2 3.33 36.67 100 41.11 10 26.67
4 6.67 66.67 100 57.78 30 45.55
8 3.33 80 100 61.11 43.33 57.22
16 3 .33 83.33 100 62.22 73.33 67.77
32 3.33 96.67 93.33 64.44 60 62.56
131
Fig. 5.8: Five States Ergodic HMM Recognition Performance for Varying Number of Mixture Components of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along with MFC
132
Table 5.6: Average recognition performance of Sub, Seg and Supra levels of LP residual for ergodic HMM with different number of states.
No. of states
Sub (%)
Seg (%)
Supra (%)
SRC=Sub+ seg+ Supra
(%)
MFCCs (%)
SRC+MFCCs (%)
1 60.47 68.66 99.33 80 91 86.67
2 67.57 73.33 99.33 85 91 90
3 41 77.33 99.33 72.33 72.67 75
4 4 70.67 96.67 57 50 50
5 4 70.67 93.33 56.68 42.33 50
133
Fig. 5.9: Average Recognition Performance of Ergodic HMM with Different Number of States of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along with MFCCs
134
Fig. 5.9 and Table 5.6 shows the ergodic HMM speaker recognition
rate for HMMs with different states in each HMM. The system is
trained with 8 speech utterances and tested with 2 speech
utterances per speaker. As illustrated in Figs. 5.9 and Table 5.6,
the average recognition performance was found to be 86.66% for a
HMM with a single state, 90% for a HMM with two states and 50%
for a HMM with five states. From the above details, it has been
observed that the average speaker recognition performance rate is
high for two state compared to others. i.e., from single state to two
states there is an increment and from two to three, four there is a
decrement. At four and five states, the average speaker recognition
performance is equal. However HMM with two state gives high
average speaker recognition performance rate whereas for three,
four and five states there is a decrease average performance of the
speaker which is shown in Table 5.6. Hence, we stopped to
compute the performance of the speaker recognition after two
states. Therefore, it reduces the computation complexity of the
Speaker recognition system. Further the recognition performance
also increases with the increase in number of mixture components.
The following section described about the speaker recognition
performance of HE and RP of LP residual at subsegmental,
segmental and suprasegmental levels.
135
5.6 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL PROCESSING OF HE AND RP OF LP RESIDUAL FOR
SPEAKER RECOGNITION USING CONTINUOUS ERGODIC HMM
In the previous section, speaker information from the LP residual
was derived by direct processing of the LP residual at the
subsegmental, segmental and suprasegmental levels. The dominant
speaker information present in these three levels of processing
mostly represents the amplitude and sequence information of the
source. When the LP residual is processed directly, the effect of
amplitude values dominates the sequence information around the
instants of glottal closure [114]. Therefore, it might have been
separated the amplitude and phase information using analytical
signal representation of the LP residual are called as HE features
and RP features as explained in the chapter 4.
The performances of these features are given in the Tables 5.7
-5.12. We observed that the performance of RP features is better
than HE features.
136
Table 5.7: Recognition performance of Sub, Seg and Supra levels of HE of LP residual for single state ergodic HMM.
No. Of mixtures
Sub (%)
Seg (%)
Supra (%)
SRC=Sub+ Seg+ Supra
(%)
MFCCs (%)
SRC+MFCCs (%)
2 3.33 23.33 100 42.33 80 61
4 6.67 53.33 100 53.33 95 74.33
8 10 66.67 100 58.88 95 76.99
16 13.33 93.33 100 68.89 90 80
32 20 100 100 73.33 95 84
137
Fig. 5.10: Recognition Performance of HE of LP residual for Single-State Ergodic HMMs of a) Sub, Seg and Supra Levels and
b) SRC=Sub+Seg+Supra along with MFCCs
138
Table 5.8: Recognition performance of Sub, Seg and Supra levels of HE of LP residual for two-states ergodic HMM.
No of Mixtures
Sub (%)
Seg (%)
Supra (%)
SRC=Sub+Seg+Supra
(%)
MFCCs (%)
SRC+MFCCs (%)
2 3.33 20 83.33 35.67 80 55
4 6.67 53.33 96.67 52.33 95 75
8 3.33 76.67 96.67 58.67 95 77.5
16 16.67 96.67 100 70 90 80
32 40 100 100 80 95 87.5
139
Fig.5.11: Recognition Performance of HE of LP Residual for two-States Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs
140
Table 5.9: Recognition performance of Sub, Seg and Supra levels of HE of LP residual for three-states ergodic HMM.
No of Mixtur
es
Sub (%)
Seg (%)
Supra (%)
SRC=Sub+Seg+Supra
(%)
MFCCs (%)
SRC+MFCCs (%)
2 3.33 16.67 83.33 33.33 50 47
4 3.33 50 96.67 50 70 60
8 10 60 96.67 60 80 70
16 13.33 83.33 100 70 86.67 78
32 26.67 100 100 80 76.67 78
141
Fig. 5.12: Recognition Performance of HE of LP Residual for three States Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs
142
Table 5.10: Recognition performance of Sub, Seg and Supra levels of RP of LP residual for single-state ergodic HMM.
No. Of mixtures
Sub (%)
Seg (%)
Supra (%)
SRC=Sub+ Seg+Supra
(%)
MFCCs (%)
SRC+ MFCCs
(%)
2 100 100 100 100 80 90
4 100 100 100 100 95 97.5
8 93.33 100 100 97.67 95 96.67
16 100 100 100 100 90 95
32 86.67 100 100 100 95 97.5
143
Fig. 5.13: Recognition Performance of RP of LP Residual for Single-State Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs
144
Table 5.11: Recognition Performance of Sub, Seg and Supra Levels of RP of LP Residual for Two-States Ergodic HMM.
No. Of mixtures
Sub (%)
Seg (%)
Supra (%)
Source=Sub+ Seg+Supra (%)
MFCCs (%)
Source+MFCCs
(%)
2 96.67 100 100 99 80 89.5
4 86.67 100 100 95.56 95 95.33
8 96.67 100 100 98.89 95 97
16 96.67 96.67 100 96.67 90 93.33
32 100 100 100 100 95 97.5
145
Fig. 5.14: Recognition Performance of RP of LP Residual for two States Ergodic HMMs of a) Sub, Seg and Supra
Levels and b) SRC=Sub+Seg+Supra along with MFCCs
146
Table 5.12: Recognition performance of Sub, Seg and Supra levels of HE of LP residual for three-states ergodic HMM.
No. Of
mixtures
Sub (%)
Seg (%)
Supra (%)
SRC=Sub+ Seg+Supra
(%)
MFCCs (%)
SRC+ MFCCs
(%)
2 100 100 100 100 50 75
4 86.67 96.67 100 95.56 70 82.77
8 93.33 93.33 100 95.44 80 87.5
16 83.33 93.33 100 93.17 86.67 90
32 100 96.67 100 98.33 76.67
90
147
Fig. 5.15: Recognition Performance of RP of LP Residual for three States Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs
148
5.7 COMBINIG EVIDENCES FROM EACH LEVEL OF HE AND RP OF LP RESIDUAL
The functioning of individual HE and RP features is pathetic
compared to the corresponding residual features. Because, as
mentioned earlier, HE and RP features independently represent two
different aspects of the information that is present in the residual
features. The integration of HE and RP features is proved to give
better performance than the residual features. The following Tables
5.13-5.15 and Figs 5.16-5.18 indicate the robustness of speaker
recognition system using the combination of HE and RP features.
149
Table 5.13: Recognition performance of sub, seg and supra levels of HE and RP of LP residual for single state ergodic HMM.
No. of mixtures
Sub of HE and
RP (%)
Seg of HE and
RP (%)
Supra of HE and
RP (%)
SRC=Sub+seg+supra
(%)
MFCCs (%)
SRC+ MFCCs
(%)
2 51.33 61.33 100 73.33 86.67 76.67
4 54 78 100 80 95 87.5
8 55 83.33 100 84 95 90
16 56.67 96.67 100 86.67 90 90
32 54 100 100 90 95 93
150
Fig. 5.16: Recognition Performance of Combination of HE and RP of LP Residual for Single State Ergodic HMM of a) Sub, Seg
and Supra levels and b) SRC=Sub+Seg+Supra along with MFCCs.
151
Table 5.14: Recognition performance of Sub, Seg and Supra levels of HE and RP of LP residual for two-states ergodic HMM.
No. of mixture
Sub of HE and
RP (%)
Seg of HE and
RP (%)
Supra of HE and
RP (%)
SRC=Sub+seg+sup
ra (%)
MFCCs (%)
SRC+MFCCs (%)
2 56.67 60 91.33 73.33 80 76.67
4 50 76.67 98.33 80 95 86.67
8 50 88.33 99 88.33 95 96.67
16 56.67 95.67 100 95 90 98.67
32 75 100 100 100 95 100
152
Fig. 5.17: Recognition Performance of Combination of HE and RP of LP Residual for Two-States Ergodic HMM of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs
153
Table 5.15: Recognition performance of sub, seg and supra levels of HE and RP of LP residual for Three-states ergodic HMM.
No. of mixtures
Sub of HE and RP (%)
Seg of HE and
RP (%)
Supra of HE and
RP (%)
Source=Sub+seg+supra
(%)
MFCCs (%)
Source +MFCCs
(%)
2 56.67 60 91.33 70 50 66.67
4 50 76.67 98.33 76.67 70 76.67
8 50 88.33 99 86.67 80 86.67
16 56.67 95.67 100 95 86.67 98.67
32 75 100 100 100 76.67 100
154
Fig. 5.18: Recognition Performance of Combination of HE and RP of LP Residual for Three-States Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs.
155
5.8 COMPARISION STUDY ON SPEAKER RECOGNITION USING ERGODIC HMM FOR LP RESIDUAL, HE, RP AND FUSION OF HE and RP FEATURES.
Table 5.16: Comparison to other recent speaker models
It has been observed that in HMM the RP Features i.e., the sequence
information gives cent percent results when compared to HE i.e.,
Amplitude information and LP Residual i.e., both Amplitude and
Sequence information gives 80% and 95.68% respectively. Since, In LP
residual Amplitude information dominates the Sequence information.
We also derived that the integration of HE and RP extracts more
efficiently by HMM.
By comparing Tables 4.12, 4.13 of Chapter 4 and Table 5.16 of
Chapter 5, we demonstrated that HMM extracts efficiently the
sequence information and intra-speaker variability than GMM.
Databases
Type of Features
(%)
Sub
(%) Seg
(%) Supra
(%)
Source=Sub+Seg+Supra
(%)
MFCCs
(%) SRC+MFCCs
(%)
Proposed MODEL
HMM using Database is
TIMIT
LP residual 85.33 100 100 95.68 95 95.33
HE 40 100 100 80 95 87.5
RP 100 100 100 100 95 97.5
HE+RP 75 100 100 100 95 100
156
5.9 SUMMARY
Different nature of source features are derived from LP
residual, HE and RP of LP residual at subsegmental, segmental and
suprasegmental levels which are used for the development of
speaker recognition using Hidden Markov models. Hidden Markov
model exploited the sequence information and it is powerful to
model intra speaker variability. Hence, the performance at
suprasegmental level of LP residual, HE of LP residual and RP of LP
residual is best than the other two levels of features. The score
levels are combined at each level of LP residual, HE of LP residual
and RP of LP residual individually to improve speaker recognition
performance. The fusion of HE of LP residual and RP of LP residual
enhances the Speaker recognition system performance compared
with Speaker recognition system using individual features alone.