CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND...

107

CHAPTER-5

SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION USING

ERGODIC HIDDEN MARKOV MODEL

In the previous chapter, we have discussed the source

features for speaker recognition using GMM. In this chapter, the

effectiveness of HMM model to capture the complete source features

at subsegmental, segmental and suprasegmental levels of LP

residual, HE of LP residual, RP of LP residual and fusion of HE and

RP of LP residual from which features are extracted and illustrated

in the speaker recognition. The main objective of this chapter is to

implement speaker recognition system using the ergodic HMM.

Firstly we start with a brief discussion on HMM. This chapter is

organized as follows: the analysis of HMM is presented in Section

5.1. Section 5.2 introduces the extraction of features from

subsegmental, segmental and suprasegmental levels processing of

LP residual signal. Section 5.3 explains the Viterbi algorithm and

its application in speaker recognition. Section 5.4 holds database

used for experimental study. Section 5.5 deals with residual

features based Speaker recognition using continuous ergodic HMM.

5.6 deals with HE features and RP features based Speaker

recognition using continuous ergodic HMM. In the section 5.7, the

improved speaker recognition system is demonstrated the

combination of both the HE features and RP features from each

level using ergodic HMM. Section 5.8 gives comparison study of

speaker recognition for residual features, HE of LP residual, RP of

LP residual and the fusion of HE and LP residual features by the

ergodic HMM. In Section 5.9 summary of this chapter is laid out.

108

5.1 SPEAKER RECOGNITION USING HIDDEN MARKOV MODEL (HMM)

Hidden Markov model (HMM) is a widely used statistical

method of characterizing the temporal properties of the time varying

frames of a pattern. Using HMM, the parameters of a stochastic

process can be estimated in a precise and well defined manner. For

modeling speech patterns, HMMs are suitable as speech can be

characterized as a parametric random process. HMM can absorb

the durational variations, and captures temporal sequencing among

the sounds. Hence, HMM based systems are well suited for speech

recognition applications.

5.1.1. Hidden Markov Model (HMM)

HMMs are similar to finite state diagrams, except that the

states in a HMM are hidden. Each transition in the state diagram of

a HMM has a transition probability associated with it. These

transition probabilities are denoted by matrix A. Here A is defined

as A=aij where aij=P (it+1 =j | it=i) the probability of being in state j at

time t+1, given that we were in state i at time t. it is assumed that

aij is independent of time.

Each state is associated with the set of continuous

observations where each set has a continuous observation

probability density. These observation symbol probabilities are

denoted by the parameters B. Here B is defined as B=bj (k), bj (k)

=P(vk at t | it = j), the probability of observing the symbol vk given

that we are in state j. The initial state probability is denoted by the

matrix π, where π is defined as π=πi, πi=P( it = i) the probability of

109

being in state i at t=1. Using the three parameters A, B and, π a

HMM can be compactly denoted as λi = (A, B, π).

There are three fundamental problems associated with HMMs

[105],which are: i) computing the likelihood of an observation

sequence which is given to a particular HMM, ii) determining the

best HMM state sequence associated with a given observation

sequence and iii) estimating the parameters of HMM, which

maximizes the likelihood of a given observation sequence. The

problem (i) is useful in the speaker recognition phase. Here, for the

given parameter sequence (observation sequence) derived from the

test speech utterance, the likelihood value of each HMM is

computed using the forward procedure [88]. Here one HMM

corresponds to one speaker. The speaker associated with the HMM,

for which the likelihood is maximum, is identified as the recognized

speaker corresponding to the input speech utterance. Problem (iii)

is associated with training of the HMM for the given speech unit.

The parameters of HMM, λ, have been iteratively refined for

maximum likelihood estimation by using Baum-Welch algorithm

[106]. Parameter estimation of the HMMs, where each state is

associated with mixtures of multi-variated densities which have

been demonstrated in [103]. The Viterbi algorithm [107 and 108] is

utilized for solving the problem (ii) as it is computationally efficient.

The objective of HMM-based speaker recognition system is to

accurately estimate the parameters of the HMM from a training

dataset.

110

5.1.2. Left-Right HMM

HMMs are characterized based on their transition matrix

A={aij}. The property for a left-right model is aij = 0, j<i. i.e., no

transition is allowed to state whose indices is less than the current

state which is shown in the Fig. 5.1. Further the initial state

probabilities exhibits the following property

πi = {

For a three state left-right model the state transition matrix is given

as

A ={aij} =

Continuous HMMs can capture speaker-specific features effectively

than the discrete HMM [109]. Continuous Left-Right HMM based

speaker recognition systems can capture only the underlying

pattern in temporal sequence of sounds [110], which gives good

recognition performance for text-dependent speaker recognition

systems. Whereas for text-independent speaker recognition

systems, as the time varying text information is completely absent,

continuous Left-Right HMM may not give good speaker recognition

performance.

111

Fig. 5.1: A Three State Continuous Left-Right HMM.

5.1.3. Continuous Ergodic HMM: The Desirable Statistical Model for Speaker Recognition

The structure for an ergodic model is defined by its transition

matrix A ={ } i,j. The other name for an ergodic model is “fully

connected HMM” as shown in the Fig. 5.2. In this model each state

can be reached from every other state of the model. The property of

an ergodic HMM is given by 0<aij < 1. The state transition matrix of

three state ergodic models is given by

112

A= {aij} =

The structural property of continuous ergodic HMM is such that it

not only captures underlying pattern in temporal sequencing of

sound units but also the patterns which are non-temporal in

nature. Hence to capture both categories of underlying patterns

continuous ergodic HMM is intended to use in the thesis for text-

independent speaker recognition.

Fig. 5.2: A Three State Continuous Ergodic HMM.

5.2 EXTRACTION OF FEATURES AT THREE LEVELS

The speech signal from the given speaker is collected, which is

sampled at 16 KHz and it is resample at 8 KHz. A frame size of

5ms and a frame shift of 2.5ms have been taken to calculate

subsegmental level of LP residual. Similarly for segmental and

suprasegmental processing of LP-residual, HE of LP residual and

RP of LP residual as explicated in the chapter 4.

113

5.3 VITERBI ALGORITHM AND IT’S APPLICATION TO SPEAKER RECOGNITION Viterbi algorithm is used in speaker recognition task where

one HMM has been trained for each speaker. Observation sequence

is derived from a speech utterance by the Viterbi algorithm to find

the most likely state sequence and the likelihood value associated

with this most likely state sequence in a given HMM [30, 111, 105

and 109]. A set of HMMs trained for predetermined set of speakers.

Viterbi algorithm can be used during the recognition phase to

determine the HMM from the set of HMMs that matches best for a

given input observation sequence. This application of Viterbi

algorithm is demonstrated in Fig. 5.3. Fig. 5.3 demonstrates a

recognition system with three ergodic HMMs. Optimal state

sequence for each HMM has been denoted with a thick line. The

likelihood value associated with each optimal state sequence is

computed, and the HMM corresponding to the maximum likelihood

has been identified.

The observation made in the Viterbi algorithm is that, for any

state at time t, there is only one most likely path to that state.

Therefore, if several paths converge to a particular state at time t,

instead of recalculating all of them when calculating the transitions

from this state to states at time t+1, one can discard the less likely

paths, and only use the most likely one in calculations. When this

is applied to each time step, the number of calculations is reduced

to T.N2, which is much lesser than TN computations in brute force

method. The steps involved in Viterbi algorithm are presented in the

following section 5.3.1.

114

Fig. 5.3: Finding the Optimal State Sequence in Ergodic HMM based Speaker Recognition System.

5.3.1. Viterbi Algorithm

In the state sequence estimation problem, a set of T

observations, O = {O1O2……OT} and a N state HMM, λ are given. The

goal is to estimate the state sequence, S = {s(1), s(2),….,s(T)} which

maximizes the likelihood L(O|S, λ). Determining the most likely

state sequence can be solved by using dynamic programming [104

and 105]. Let j (t) represent the probability of the most likely state

sequence for observing vectors o1 through ot, while at state j, at

time t, and Bj (t) represents the state which gives this probability,

then j (t) and Bj (t) can be expressed as

115

j(t)=maxi{ j(t-1)ij}bj (ot) (5.1)

Bj(t)=arg (maxi j(t-1i j}bj(ot)) (5.2)

Using initial conditions

j (1) = 1 (5.3)

B i (1) = 0 (5.4)

j (1) = ai j bj(ot) for 1< j (5.5)

Bj(1) = 1 (5.6)

In Eq. 5.1, the probability j (t) is computed using a recursive

relation. Using Bj(t) and assuming that the model must end in the

final state at time T, (s(T) = N), the sequence of states for the

maximum likelihood path can be recovered recursively using the

equation.

S(t-1) = Bs(t)(t). (5.7)

In other words, starting with s(T) known, Eq. 5.7 gives the

maximum likelihood state at time T-1(e.g. s(T-1) = Bs(t)(t) = BN(t) ).

5.4 DATABASE USED FOR EXPERIMENTAL STUDY

As mentioned in the section 4.3.1, the TIMIT corpus of read

speech has been used to evaluate the speaker recognition system.

We have considered 38 speakers for different training and testing

utterances. Throughout this study, closed set identification

experiments are done to manifest the feasibility of capturing the

speaker- specific information from the LP residual, HE and RP of LP

residual at subsegmental, segmental and suprasegmental levels.

The following sections exemplify the speaker recognition

performance using ergodic HMM.

116

5.5 FOR SPEAKER RECOGNITION USING CONTINUOUS ERGODIC HMM AT THREE LEVELS

The system has been implemented in Matlab7 on windows XP

platform. We have used LP order of 12 for all experiments. We have

trained the model (HMM) using Gaussian components as 2, 4, 8 16

and 32 at each state for training speech utterances and testing

speech utterances. The steps involved in the proposed algorithm for

text independent speaker recognition system for subsegmental,

segmental and suprasegmental features from LP residual are as

follows:

Training Phase for subsegmental level:

For each speaker Pj from speaker list N do

For each speech signal Si of speaker Pj

Preprocess of speech Si

Compute Ŝi using LP approximation

Compute LP residual

ei = Si – Ŝi

for each sample of ei from K samples do

Extract subsegmental features fk from ei at subsegmental level

end

end

Initialize HMM model parameters λj = (A,B,π)

117

Train λj for optimal solution using EM algorithm

end

Testing Phase for Subsegmental level





Compute LP residual

ei = Si – Ŝi



end

end

for each model λ1 λ2….. λN do

Using the Viterbi decoding process calculate P (O| λj), where P(O|λj)

is the probability of the observation sequence O(o1o2……oT)

end

Calculate 1-best result for a given testing speech signal using

arg. j)

118

end

Training Phase for segmental level:





Compute LP residual

ei = Si – Ŝi



end

end

Initialize HMM model parameters λj = (A, B, π)


end

Testing Phase for Segmental level:




119


Compute LP residual

ei = Si – Ŝi



end

end


Using the Viterbi decoding process calculate P (O| λj), where P(O| λj)

is the probability of the observation sequence O (o1o2……oT)

end


arg. j)

end

Training Phase for Suprasegmental level:





120

Compute LP residual

ei = Si – Ŝi



end

end

Initialize HMM model parameters λj = (A,B,π)


end

Testing Phase for Suprasegmental level:





Compute LP residual

ei = Si – Ŝi



end

121

end


Using the Viterbi decoding process calculate P(O| λj), where P(O| λj)

is the probability of the observation sequence O(o1o2……oT)

end


arg. j)

end

The speaker recognition rate is defined as the ratio of the number of

speakers recognized to the total number of speakers tested. We

have calculated speaker recognition rate for various model

parameters such as numerous values of Gaussian mixtures, and

numerous values of Hidden Markov model states. The tabulated

recognition performance at subsegmental, segmental and

suprasegmental levels of LP residual and the corresponding charts

for different model parameters are shown in Figs. 5.4 to 5.10 and

Tables 5.1 to 5.5.

Fig. 5.4 and Table 5.1 show a two-states ergodic HMM

speaker recognition performance for different number of Gaussian

mixture components. The system is trained with 8 speech

utterances and tested with 2 speech utterances per speaker. As

demonstrated in the Fig. 5.4 and Table 5.1 the recognition

performance of subsegmental, segmental and suprasegmental levels

of LP residual , combine feature scores of each level of LP

residual(complete source features or residual features) and along

122

with MFCCs were found to be 85.33%, 100% ,100%, and 95.33%

for a model with 32 Gaussian components respectively. From the

above illustration, it is observed that as the number of mixture

components increases the speaker recognition rate also increases.

The performance of speaker recognition system have been given in

the form of percentile(%) in the all the Tables of this chapter.

Table 5.1: Recognition performance of subsegmental (Sub), segmental (Seg) and suprasegmental levels and their combination of LP residual along with MFCCs at single-state ergodic HMM.

No. of mixtures

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+ seg+supra

(%)

MFCC’s (%)

SRC+ MFCC’s

(%)

2 6.67 23.33 100 55.18 80 65.5

4 63.33 36.67 100 75.55 95 83

8 70 83.33 100 84.68 95 90

16 80 100 96.67 92.11 90 92

32 85.33 100 100 95.67 95 95.33

123

Fig. 5.4: Single State Ergodic HMM Recognition Performance for Varying Number of Mixture Components of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+supra along with MFCCs.

124

Table 5.2: Recognition Performance of sub, seg and supra levels of LP residual and their combination along with MFCCs for two states ergodic HMM.

No. of mixtures

Sub (%)

Seg (%)

Supra (%)

SRC= Sub+seg+

supra (%)

MFCCs (%)

SRC+ MFCCs

(%)

2 42.22 23.33 100 55.18 80 65.5

4 63.33 63.33 100 75.55 95 83

8 70 83.33 100 84.67 95 89.88

16 80 96.33 96.67 92.11 90 91.11

32 85.33 100 100 95.67 95 95.33

125

Fig. 5.5: Two-States Ergodic HMM recognition Performance for Varying Number of Mixture Components of a) Sub, Seg and

Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along with MFCCs.

126

Table 5.3: Recognition performance of Sub, Seg and Supra levels of LP residual and their combination along with MFCCs for three states ergodic HMM.

No.Of mixtures

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+seg+supra

(%)

MFCCs (%)

SRC+MFCC’s (%)

2 16.67 36.67 100 51.11 50 50.5

4 3.33 63.33 100 55.56 70 62.77

8 20 93.33 100 71.11 80 75.56

16 26.33 96.33 96.67 64.11 86.67 70.39

32 56.67 100 100 85.22 76.67 83.33

127

Fig. 5.6: Three States Ergodic HMM Recognition Performance for Varying number of Mixture Components of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along

with MFCCs.

128

Table 5.4: Recognition performance of Sub, Seg and Supra levels of LP residual and their combination along with MFCCs for four states ergodic HMM.

.

No. of mixtures

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+ seg+supra

(%)

MFCCs (%)

SRC+MFCCs (%)

2 3.33 26.67 93.33 41.11 20 30.55

4 6.67 66.67 100 57.77 53.33 55.55

8 3.33 80 100 61.11 53.33 57.22

16 3 .33 86.67 96.67 62.22 73.33 67.77

32 3.33 96.67 93.33 64.44 56.67 60.55

129

Fig. 5.7: Four States Ergodic HMM Recognition Performance for Varying Number of Mixture Components of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along with MFCCs.

130

Table 5.5: Recognition performance of Sub, Seg and Supra levels of LP residual and their combination along with MFCCs for five-states ergodic HMM.

No. of mixtures

Sub (%)

Seg (%)

Supra (%)

SRC= Sub+seg+supra (%)

MFCCs (%)

SRC+MFCC’s (%)

2 3.33 36.67 100 41.11 10 26.67

4 6.67 66.67 100 57.78 30 45.55

8 3.33 80 100 61.11 43.33 57.22

16 3 .33 83.33 100 62.22 73.33 67.77

32 3.33 96.67 93.33 64.44 60 62.56

131

Fig. 5.8: Five States Ergodic HMM Recognition Performance for Varying Number of Mixture Components of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along with MFC

132

Table 5.6: Average recognition performance of Sub, Seg and Supra levels of LP residual for ergodic HMM with different number of states.

No. of states

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+ seg+ Supra

(%)

MFCCs (%)

SRC+MFCCs (%)

1 60.47 68.66 99.33 80 91 86.67

2 67.57 73.33 99.33 85 91 90

3 41 77.33 99.33 72.33 72.67 75

4 4 70.67 96.67 57 50 50

5 4 70.67 93.33 56.68 42.33 50

133

Fig. 5.9: Average Recognition Performance of Ergodic HMM with Different Number of States of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along with MFCCs

134

Fig. 5.9 and Table 5.6 shows the ergodic HMM speaker recognition

rate for HMMs with different states in each HMM. The system is

trained with 8 speech utterances and tested with 2 speech

utterances per speaker. As illustrated in Figs. 5.9 and Table 5.6,

the average recognition performance was found to be 86.66% for a

HMM with a single state, 90% for a HMM with two states and 50%

for a HMM with five states. From the above details, it has been

observed that the average speaker recognition performance rate is

high for two state compared to others. i.e., from single state to two

states there is an increment and from two to three, four there is a

decrement. At four and five states, the average speaker recognition

performance is equal. However HMM with two state gives high

average speaker recognition performance rate whereas for three,

four and five states there is a decrease average performance of the

speaker which is shown in Table 5.6. Hence, we stopped to

compute the performance of the speaker recognition after two

states. Therefore, it reduces the computation complexity of the

Speaker recognition system. Further the recognition performance

also increases with the increase in number of mixture components.

The following section described about the speaker recognition

performance of HE and RP of LP residual at subsegmental,

segmental and suprasegmental levels.

135

5.6 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL PROCESSING OF HE AND RP OF LP RESIDUAL FOR

SPEAKER RECOGNITION USING CONTINUOUS ERGODIC HMM

In the previous section, speaker information from the LP residual

was derived by direct processing of the LP residual at the

subsegmental, segmental and suprasegmental levels. The dominant

speaker information present in these three levels of processing

mostly represents the amplitude and sequence information of the

source. When the LP residual is processed directly, the effect of

amplitude values dominates the sequence information around the

instants of glottal closure [114]. Therefore, it might have been

separated the amplitude and phase information using analytical

signal representation of the LP residual are called as HE features

and RP features as explained in the chapter 4.

The performances of these features are given in the Tables 5.7

-5.12. We observed that the performance of RP features is better

than HE features.

136

Table 5.7: Recognition performance of Sub, Seg and Supra levels of HE of LP residual for single state ergodic HMM.

No. Of mixtures

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+ Seg+ Supra

(%)

MFCCs (%)

SRC+MFCCs (%)

2 3.33 23.33 100 42.33 80 61

4 6.67 53.33 100 53.33 95 74.33

8 10 66.67 100 58.88 95 76.99

16 13.33 93.33 100 68.89 90 80

32 20 100 100 73.33 95 84

137

Fig. 5.10: Recognition Performance of HE of LP residual for Single-State Ergodic HMMs of a) Sub, Seg and Supra Levels and

b) SRC=Sub+Seg+Supra along with MFCCs

138

Table 5.8: Recognition performance of Sub, Seg and Supra levels of HE of LP residual for two-states ergodic HMM.

No of Mixtures

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+Seg+Supra

(%)

MFCCs (%)

SRC+MFCCs (%)

2 3.33 20 83.33 35.67 80 55

4 6.67 53.33 96.67 52.33 95 75

8 3.33 76.67 96.67 58.67 95 77.5

16 16.67 96.67 100 70 90 80

32 40 100 100 80 95 87.5

139

Fig.5.11: Recognition Performance of HE of LP Residual for two-States Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs

140

Table 5.9: Recognition performance of Sub, Seg and Supra levels of HE of LP residual for three-states ergodic HMM.

No of Mixtur

es

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+Seg+Supra

(%)

MFCCs (%)

SRC+MFCCs (%)

2 3.33 16.67 83.33 33.33 50 47

4 3.33 50 96.67 50 70 60

8 10 60 96.67 60 80 70

16 13.33 83.33 100 70 86.67 78

32 26.67 100 100 80 76.67 78

141

Fig. 5.12: Recognition Performance of HE of LP Residual for three States Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs

142

Table 5.10: Recognition performance of Sub, Seg and Supra levels of RP of LP residual for single-state ergodic HMM.

No. Of mixtures

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+ Seg+Supra

(%)

MFCCs (%)

SRC+ MFCCs

(%)

2 100 100 100 100 80 90

4 100 100 100 100 95 97.5

8 93.33 100 100 97.67 95 96.67

16 100 100 100 100 90 95

32 86.67 100 100 100 95 97.5

143

Fig. 5.13: Recognition Performance of RP of LP Residual for Single-State Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs

144

Table 5.11: Recognition Performance of Sub, Seg and Supra Levels of RP of LP Residual for Two-States Ergodic HMM.

No. Of mixtures

Sub (%)

Seg (%)

Supra (%)

Source=Sub+ Seg+Supra (%)

MFCCs (%)

Source+MFCCs

(%)

2 96.67 100 100 99 80 89.5

4 86.67 100 100 95.56 95 95.33

8 96.67 100 100 98.89 95 97

16 96.67 96.67 100 96.67 90 93.33

32 100 100 100 100 95 97.5

145

Fig. 5.14: Recognition Performance of RP of LP Residual for two States Ergodic HMMs of a) Sub, Seg and Supra

Levels and b) SRC=Sub+Seg+Supra along with MFCCs

146

Table 5.12: Recognition performance of Sub, Seg and Supra levels of HE of LP residual for three-states ergodic HMM.

No. Of

mixtures

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+ Seg+Supra

(%)

MFCCs (%)

SRC+ MFCCs

(%)

2 100 100 100 100 50 75

4 86.67 96.67 100 95.56 70 82.77

8 93.33 93.33 100 95.44 80 87.5

16 83.33 93.33 100 93.17 86.67 90

32 100 96.67 100 98.33 76.67

90

147

Fig. 5.15: Recognition Performance of RP of LP Residual for three States Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs

148

5.7 COMBINIG EVIDENCES FROM EACH LEVEL OF HE AND RP OF LP RESIDUAL

The functioning of individual HE and RP features is pathetic

compared to the corresponding residual features. Because, as

mentioned earlier, HE and RP features independently represent two

different aspects of the information that is present in the residual

features. The integration of HE and RP features is proved to give

better performance than the residual features. The following Tables

5.13-5.15 and Figs 5.16-5.18 indicate the robustness of speaker

recognition system using the combination of HE and RP features.

149

Table 5.13: Recognition performance of sub, seg and supra levels of HE and RP of LP residual for single state ergodic HMM.

No. of mixtures

Sub of HE and

RP (%)

Seg of HE and

RP (%)

Supra of HE and

RP (%)

SRC=Sub+seg+supra

(%)

MFCCs (%)

SRC+ MFCCs

(%)

2 51.33 61.33 100 73.33 86.67 76.67

4 54 78 100 80 95 87.5

8 55 83.33 100 84 95 90

16 56.67 96.67 100 86.67 90 90

32 54 100 100 90 95 93

150

Fig. 5.16: Recognition Performance of Combination of HE and RP of LP Residual for Single State Ergodic HMM of a) Sub, Seg

and Supra levels and b) SRC=Sub+Seg+Supra along with MFCCs.

151

Table 5.14: Recognition performance of Sub, Seg and Supra levels of HE and RP of LP residual for two-states ergodic HMM.

No. of mixture

Sub of HE and

RP (%)

Seg of HE and

RP (%)

Supra of HE and

RP (%)

SRC=Sub+seg+sup

ra (%)

MFCCs (%)

SRC+MFCCs (%)

2 56.67 60 91.33 73.33 80 76.67

4 50 76.67 98.33 80 95 86.67

8 50 88.33 99 88.33 95 96.67

16 56.67 95.67 100 95 90 98.67

32 75 100 100 100 95 100

152

Fig. 5.17: Recognition Performance of Combination of HE and RP of LP Residual for Two-States Ergodic HMM of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs

153

Table 5.15: Recognition performance of sub, seg and supra levels of HE and RP of LP residual for Three-states ergodic HMM.

No. of mixtures

Sub of HE and RP (%)

Seg of HE and

RP (%)

Supra of HE and

RP (%)

Source=Sub+seg+supra

(%)

MFCCs (%)

Source +MFCCs

(%)

2 56.67 60 91.33 70 50 66.67

4 50 76.67 98.33 76.67 70 76.67

8 50 88.33 99 86.67 80 86.67

16 56.67 95.67 100 95 86.67 98.67

32 75 100 100 100 76.67 100

154

Fig. 5.18: Recognition Performance of Combination of HE and RP of LP Residual for Three-States Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs.

155

5.8 COMPARISION STUDY ON SPEAKER RECOGNITION USING ERGODIC HMM FOR LP RESIDUAL, HE, RP AND FUSION OF HE and RP FEATURES.

Table 5.16: Comparison to other recent speaker models

It has been observed that in HMM the RP Features i.e., the sequence

information gives cent percent results when compared to HE i.e.,

Amplitude information and LP Residual i.e., both Amplitude and

Sequence information gives 80% and 95.68% respectively. Since, In LP

residual Amplitude information dominates the Sequence information.

We also derived that the integration of HE and RP extracts more

efficiently by HMM.

By comparing Tables 4.12, 4.13 of Chapter 4 and Table 5.16 of

Chapter 5, we demonstrated that HMM extracts efficiently the

sequence information and intra-speaker variability than GMM.

Databases

Type of Features

(%)

Sub

(%) Seg

(%) Supra

(%)

Source=Sub+Seg+Supra

(%)

MFCCs

(%) SRC+MFCCs

(%)

Proposed MODEL

HMM using Database is

TIMIT

LP residual 85.33 100 100 95.68 95 95.33

HE 40 100 100 80 95 87.5

RP 100 100 100 100 95 97.5

HE+RP 75 100 100 100 95 100

156

5.9 SUMMARY

Different nature of source features are derived from LP

residual, HE and RP of LP residual at subsegmental, segmental and

suprasegmental levels which are used for the development of

speaker recognition using Hidden Markov models. Hidden Markov

model exploited the sequence information and it is powerful to

model intra speaker variability. Hence, the performance at

suprasegmental level of LP residual, HE of LP residual and RP of LP

residual is best than the other two levels of features. The score

levels are combined at each level of LP residual, HE of LP residual

and RP of LP residual individually to improve speaker recognition

performance. The fusion of HE of LP residual and RP of LP residual

enhances the Speaker recognition system performance compared

with Speaker recognition system using individual features alone.

Date post:	14-Feb-2020
Category:	Documents
Upload:	others
View:	15 times
Download:	0 times

CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND...

Documents