+ All Categories
Home > Documents > CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND...

CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND...

Date post: 14-Feb-2020
Category:
Upload: others
View: 15 times
Download: 0 times
Share this document with a friend
50
107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION USING ERGODIC HIDDEN MARKOV MODEL In the previous chapter, we have discussed the source features for speaker recognition using GMM. In this chapter, the effectiveness of HMM model to capture the complete source features at subsegmental, segmental and suprasegmental levels of LP residual, HE of LP residual, RP of LP residual and fusion of HE and RP of LP residual from which features are extracted and illustrated in the speaker recognition. The main objective of this chapter is to implement speaker recognition system using the ergodic HMM. Firstly we start with a brief discussion on HMM. This chapter is organized as follows: the analysis of HMM is presented in Section 5.1. Section 5.2 introduces the extraction of features from subsegmental, segmental and suprasegmental levels processing of LP residual signal. Section 5.3 explains the Viterbi algorithm and its application in speaker recognition. Section 5.4 holds database used for experimental study. Section 5.5 deals with residual features based Speaker recognition using continuous ergodic HMM. 5.6 deals with HE features and RP features based Speaker recognition using continuous ergodic HMM. In the section 5.7, the improved speaker recognition system is demonstrated the combination of both the HE features and RP features from each level using ergodic HMM. Section 5.8 gives comparison study of speaker recognition for residual features, HE of LP residual, RP of LP residual and the fusion of HE and LP residual features by the ergodic HMM. In Section 5.9 summary of this chapter is laid out.
Transcript
Page 1: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

107

CHAPTER-5

SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION USING

ERGODIC HIDDEN MARKOV MODEL

In the previous chapter, we have discussed the source

features for speaker recognition using GMM. In this chapter, the

effectiveness of HMM model to capture the complete source features

at subsegmental, segmental and suprasegmental levels of LP

residual, HE of LP residual, RP of LP residual and fusion of HE and

RP of LP residual from which features are extracted and illustrated

in the speaker recognition. The main objective of this chapter is to

implement speaker recognition system using the ergodic HMM.

Firstly we start with a brief discussion on HMM. This chapter is

organized as follows: the analysis of HMM is presented in Section

5.1. Section 5.2 introduces the extraction of features from

subsegmental, segmental and suprasegmental levels processing of

LP residual signal. Section 5.3 explains the Viterbi algorithm and

its application in speaker recognition. Section 5.4 holds database

used for experimental study. Section 5.5 deals with residual

features based Speaker recognition using continuous ergodic HMM.

5.6 deals with HE features and RP features based Speaker

recognition using continuous ergodic HMM. In the section 5.7, the

improved speaker recognition system is demonstrated the

combination of both the HE features and RP features from each

level using ergodic HMM. Section 5.8 gives comparison study of

speaker recognition for residual features, HE of LP residual, RP of

LP residual and the fusion of HE and LP residual features by the

ergodic HMM. In Section 5.9 summary of this chapter is laid out.

Page 2: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

108

5.1 SPEAKER RECOGNITION USING HIDDEN MARKOV MODEL (HMM)

Hidden Markov model (HMM) is a widely used statistical

method of characterizing the temporal properties of the time varying

frames of a pattern. Using HMM, the parameters of a stochastic

process can be estimated in a precise and well defined manner. For

modeling speech patterns, HMMs are suitable as speech can be

characterized as a parametric random process. HMM can absorb

the durational variations, and captures temporal sequencing among

the sounds. Hence, HMM based systems are well suited for speech

recognition applications.

5.1.1. Hidden Markov Model (HMM)

HMMs are similar to finite state diagrams, except that the

states in a HMM are hidden. Each transition in the state diagram of

a HMM has a transition probability associated with it. These

transition probabilities are denoted by matrix A. Here A is defined

as A=aij where aij=P (it+1 =j | it=i) the probability of being in state j at

time t+1, given that we were in state i at time t. it is assumed that

aij is independent of time.

Each state is associated with the set of continuous

observations where each set has a continuous observation

probability density. These observation symbol probabilities are

denoted by the parameters B. Here B is defined as B=bj (k), bj (k)

=P(vk at t | it = j), the probability of observing the symbol vk given

that we are in state j. The initial state probability is denoted by the

matrix π, where π is defined as π=πi, πi=P( it = i) the probability of

Page 3: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

109

being in state i at t=1. Using the three parameters A, B and, π a

HMM can be compactly denoted as λi = (A, B, π).

There are three fundamental problems associated with HMMs

[105],which are: i) computing the likelihood of an observation

sequence which is given to a particular HMM, ii) determining the

best HMM state sequence associated with a given observation

sequence and iii) estimating the parameters of HMM, which

maximizes the likelihood of a given observation sequence. The

problem (i) is useful in the speaker recognition phase. Here, for the

given parameter sequence (observation sequence) derived from the

test speech utterance, the likelihood value of each HMM is

computed using the forward procedure [88]. Here one HMM

corresponds to one speaker. The speaker associated with the HMM,

for which the likelihood is maximum, is identified as the recognized

speaker corresponding to the input speech utterance. Problem (iii)

is associated with training of the HMM for the given speech unit.

The parameters of HMM, λ, have been iteratively refined for

maximum likelihood estimation by using Baum-Welch algorithm

[106]. Parameter estimation of the HMMs, where each state is

associated with mixtures of multi-variated densities which have

been demonstrated in [103]. The Viterbi algorithm [107 and 108] is

utilized for solving the problem (ii) as it is computationally efficient.

The objective of HMM-based speaker recognition system is to

accurately estimate the parameters of the HMM from a training

dataset.

Page 4: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

110

5.1.2. Left-Right HMM

HMMs are characterized based on their transition matrix

A={aij}. The property for a left-right model is aij = 0, j<i. i.e., no

transition is allowed to state whose indices is less than the current

state which is shown in the Fig. 5.1. Further the initial state

probabilities exhibits the following property

πi = {

For a three state left-right model the state transition matrix is given

as

A ={aij} =

Continuous HMMs can capture speaker-specific features effectively

than the discrete HMM [109]. Continuous Left-Right HMM based

speaker recognition systems can capture only the underlying

pattern in temporal sequence of sounds [110], which gives good

recognition performance for text-dependent speaker recognition

systems. Whereas for text-independent speaker recognition

systems, as the time varying text information is completely absent,

continuous Left-Right HMM may not give good speaker recognition

performance.

Page 5: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

111

Fig. 5.1: A Three State Continuous Left-Right HMM.

5.1.3. Continuous Ergodic HMM: The Desirable Statistical Model for Speaker Recognition

The structure for an ergodic model is defined by its transition

matrix A ={ } i,j. The other name for an ergodic model is “fully

connected HMM” as shown in the Fig. 5.2. In this model each state

can be reached from every other state of the model. The property of

an ergodic HMM is given by 0<aij < 1. The state transition matrix of

three state ergodic models is given by

Page 6: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

112

A= {aij} =

The structural property of continuous ergodic HMM is such that it

not only captures underlying pattern in temporal sequencing of

sound units but also the patterns which are non-temporal in

nature. Hence to capture both categories of underlying patterns

continuous ergodic HMM is intended to use in the thesis for text-

independent speaker recognition.

Fig. 5.2: A Three State Continuous Ergodic HMM.

5.2 EXTRACTION OF FEATURES AT THREE LEVELS

The speech signal from the given speaker is collected, which is

sampled at 16 KHz and it is resample at 8 KHz. A frame size of

5ms and a frame shift of 2.5ms have been taken to calculate

subsegmental level of LP residual. Similarly for segmental and

suprasegmental processing of LP-residual, HE of LP residual and

RP of LP residual as explicated in the chapter 4.

Page 7: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

113

5.3 VITERBI ALGORITHM AND IT’S APPLICATION TO SPEAKER RECOGNITION Viterbi algorithm is used in speaker recognition task where

one HMM has been trained for each speaker. Observation sequence

is derived from a speech utterance by the Viterbi algorithm to find

the most likely state sequence and the likelihood value associated

with this most likely state sequence in a given HMM [30, 111, 105

and 109]. A set of HMMs trained for predetermined set of speakers.

Viterbi algorithm can be used during the recognition phase to

determine the HMM from the set of HMMs that matches best for a

given input observation sequence. This application of Viterbi

algorithm is demonstrated in Fig. 5.3. Fig. 5.3 demonstrates a

recognition system with three ergodic HMMs. Optimal state

sequence for each HMM has been denoted with a thick line. The

likelihood value associated with each optimal state sequence is

computed, and the HMM corresponding to the maximum likelihood

has been identified.

The observation made in the Viterbi algorithm is that, for any

state at time t, there is only one most likely path to that state.

Therefore, if several paths converge to a particular state at time t,

instead of recalculating all of them when calculating the transitions

from this state to states at time t+1, one can discard the less likely

paths, and only use the most likely one in calculations. When this

is applied to each time step, the number of calculations is reduced

to T.N2, which is much lesser than TN computations in brute force

method. The steps involved in Viterbi algorithm are presented in the

following section 5.3.1.

Page 8: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

114

Fig. 5.3: Finding the Optimal State Sequence in Ergodic HMM based Speaker Recognition System.

5.3.1. Viterbi Algorithm

In the state sequence estimation problem, a set of T

observations, O = {O1O2……OT} and a N state HMM, λ are given. The

goal is to estimate the state sequence, S = {s(1), s(2),….,s(T)} which

maximizes the likelihood L(O|S, λ). Determining the most likely

state sequence can be solved by using dynamic programming [104

and 105]. Let j (t) represent the probability of the most likely state

sequence for observing vectors o1 through ot, while at state j, at

time t, and Bj (t) represents the state which gives this probability,

then j (t) and Bj (t) can be expressed as

Page 9: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

115

j(t)=maxi{ j(t-1)ij}bj (ot) (5.1)

Bj(t)=arg (maxi j(t-1i j}bj(ot)) (5.2)

Using initial conditions

j (1) = 1 (5.3)

B i (1) = 0 (5.4)

j (1) = ai j bj(ot) for 1< j (5.5)

Bj(1) = 1 (5.6)

In Eq. 5.1, the probability j (t) is computed using a recursive

relation. Using Bj(t) and assuming that the model must end in the

final state at time T, (s(T) = N), the sequence of states for the

maximum likelihood path can be recovered recursively using the

equation.

S(t-1) = Bs(t)(t). (5.7)

In other words, starting with s(T) known, Eq. 5.7 gives the

maximum likelihood state at time T-1(e.g. s(T-1) = Bs(t)(t) = BN(t) ).

5.4 DATABASE USED FOR EXPERIMENTAL STUDY

As mentioned in the section 4.3.1, the TIMIT corpus of read

speech has been used to evaluate the speaker recognition system.

We have considered 38 speakers for different training and testing

utterances. Throughout this study, closed set identification

experiments are done to manifest the feasibility of capturing the

speaker- specific information from the LP residual, HE and RP of LP

residual at subsegmental, segmental and suprasegmental levels.

The following sections exemplify the speaker recognition

performance using ergodic HMM.

Page 10: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

116

5.5 FOR SPEAKER RECOGNITION USING CONTINUOUS ERGODIC HMM AT THREE LEVELS

The system has been implemented in Matlab7 on windows XP

platform. We have used LP order of 12 for all experiments. We have

trained the model (HMM) using Gaussian components as 2, 4, 8 16

and 32 at each state for training speech utterances and testing

speech utterances. The steps involved in the proposed algorithm for

text independent speaker recognition system for subsegmental,

segmental and suprasegmental features from LP residual are as

follows:

Training Phase for subsegmental level:

For each speaker Pj from speaker list N do

For each speech signal Si of speaker Pj

Preprocess of speech Si

Compute Ŝi using LP approximation

Compute LP residual

ei = Si – Ŝi

for each sample of ei from K samples do

Extract subsegmental features fk from ei at subsegmental level

end

end

Initialize HMM model parameters λj = (A,B,π)

Page 11: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

117

Train λj for optimal solution using EM algorithm

end

Testing Phase for Subsegmental level

For each speaker Pj from speaker list N do

For each speech signal Si of speaker Pj

Preprocess of speech Si

Compute Ŝi using LP approximation

Compute LP residual

ei = Si – Ŝi

for each sample of ei from K samples do

Extract subsegmental features fk from ei at subsegmental level

end

end

for each model λ1 λ2….. λN do

Using the Viterbi decoding process calculate P (O| λj), where P(O|λj)

is the probability of the observation sequence O(o1o2……oT)

end

Calculate 1-best result for a given testing speech signal using

arg. j)

Page 12: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

118

end

Training Phase for segmental level:

For each speaker Pj from speaker list N do

For each speech signal Si of speaker Pj

Preprocess of speech Si

Compute Ŝi using LP approximation

Compute LP residual

ei = Si – Ŝi

for each sample of ei from K samples do

Extract subsegmental features fk from ei at subsegmental level

end

end

Initialize HMM model parameters λj = (A, B, π)

Train λj for optimal solution using EM algorithm

end

Testing Phase for Segmental level:

For each speaker Pj from speaker list N do

For each speech signal Si of speaker Pj

Preprocess of speech Si

Page 13: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

119

Compute Ŝi using LP approximation

Compute LP residual

ei = Si – Ŝi

for each sample of ei from K samples do

Extract subsegmental features fk from ei at subsegmental level

end

end

for each model λ1 λ2….. λN do

Using the Viterbi decoding process calculate P (O| λj), where P(O| λj)

is the probability of the observation sequence O (o1o2……oT)

end

Calculate 1-best result for a given testing speech signal using

arg. j)

end

Training Phase for Suprasegmental level:

For each speaker Pj from speaker list N do

For each speech signal Si of speaker Pj

Preprocess of speech Si

Compute Ŝi using LP approximation

Page 14: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

120

Compute LP residual

ei = Si – Ŝi

for each sample of ei from K samples do

Extract subsegmental features fk from ei at subsegmental level

end

end

Initialize HMM model parameters λj = (A,B,π)

Train λj for optimal solution using EM algorithm

end

Testing Phase for Suprasegmental level:

For each speaker Pj from speaker list N do

For each speech signal Si of speaker Pj

Preprocess of speech Si

Compute Ŝi using LP approximation

Compute LP residual

ei = Si – Ŝi

for each sample of ei from K samples do

Extract subsegmental features fk from ei at subsegmental level

end

Page 15: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

121

end

for each model λ1 λ2….. λN do

Using the Viterbi decoding process calculate P(O| λj), where P(O| λj)

is the probability of the observation sequence O(o1o2……oT)

end

Calculate 1-best result for a given testing speech signal using

arg. j)

end

The speaker recognition rate is defined as the ratio of the number of

speakers recognized to the total number of speakers tested. We

have calculated speaker recognition rate for various model

parameters such as numerous values of Gaussian mixtures, and

numerous values of Hidden Markov model states. The tabulated

recognition performance at subsegmental, segmental and

suprasegmental levels of LP residual and the corresponding charts

for different model parameters are shown in Figs. 5.4 to 5.10 and

Tables 5.1 to 5.5.

Fig. 5.4 and Table 5.1 show a two-states ergodic HMM

speaker recognition performance for different number of Gaussian

mixture components. The system is trained with 8 speech

utterances and tested with 2 speech utterances per speaker. As

demonstrated in the Fig. 5.4 and Table 5.1 the recognition

performance of subsegmental, segmental and suprasegmental levels

of LP residual , combine feature scores of each level of LP

residual(complete source features or residual features) and along

Page 16: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

122

with MFCCs were found to be 85.33%, 100% ,100%, and 95.33%

for a model with 32 Gaussian components respectively. From the

above illustration, it is observed that as the number of mixture

components increases the speaker recognition rate also increases.

The performance of speaker recognition system have been given in

the form of percentile(%) in the all the Tables of this chapter.

Table 5.1: Recognition performance of subsegmental (Sub), segmental (Seg) and suprasegmental levels and their combination of LP residual along with MFCCs at single-state ergodic HMM.

No. of mixtures

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+ seg+supra

(%)

MFCC’s (%)

SRC+ MFCC’s

(%)

2 6.67 23.33 100 55.18 80 65.5

4 63.33 36.67 100 75.55 95 83

8 70 83.33 100 84.68 95 90

16 80 100 96.67 92.11 90 92

32 85.33 100 100 95.67 95 95.33

Page 17: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

123

Fig. 5.4: Single State Ergodic HMM Recognition Performance for Varying Number of Mixture Components of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+supra along with MFCCs.

Page 18: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

124

Table 5.2: Recognition Performance of sub, seg and supra levels of LP residual and their combination along with MFCCs for two states ergodic HMM.

No. of mixtures

Sub (%)

Seg (%)

Supra (%)

SRC= Sub+seg+

supra (%)

MFCCs (%)

SRC+ MFCCs

(%)

2 42.22 23.33 100 55.18 80 65.5

4 63.33 63.33 100 75.55 95 83

8 70 83.33 100 84.67 95 89.88

16 80 96.33 96.67 92.11 90 91.11

32 85.33 100 100 95.67 95 95.33

Page 19: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

125

Fig. 5.5: Two-States Ergodic HMM recognition Performance for Varying Number of Mixture Components of a) Sub, Seg and

Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along with MFCCs.

Page 20: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

126

Table 5.3: Recognition performance of Sub, Seg and Supra levels of LP residual and their combination along with MFCCs for three states ergodic HMM.

No.Of mixtures

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+seg+supra

(%)

MFCCs (%)

SRC+MFCC’s (%)

2 16.67 36.67 100 51.11 50 50.5

4 3.33 63.33 100 55.56 70 62.77

8 20 93.33 100 71.11 80 75.56

16 26.33 96.33 96.67 64.11 86.67 70.39

32 56.67 100 100 85.22 76.67 83.33

Page 21: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

127

Fig. 5.6: Three States Ergodic HMM Recognition Performance for Varying number of Mixture Components of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along

with MFCCs.

Page 22: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

128

Table 5.4: Recognition performance of Sub, Seg and Supra levels of LP residual and their combination along with MFCCs for four states ergodic HMM.

.

No. of mixtures

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+ seg+supra

(%)

MFCCs (%)

SRC+MFCCs (%)

2 3.33 26.67 93.33 41.11 20 30.55

4 6.67 66.67 100 57.77 53.33 55.55

8 3.33 80 100 61.11 53.33 57.22

16 3 .33 86.67 96.67 62.22 73.33 67.77

32 3.33 96.67 93.33 64.44 56.67 60.55

Page 23: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

129

Fig. 5.7: Four States Ergodic HMM Recognition Performance for Varying Number of Mixture Components of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along with MFCCs.

Page 24: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

130

Table 5.5: Recognition performance of Sub, Seg and Supra levels of LP residual and their combination along with MFCCs for five-states ergodic HMM.

No. of mixtures

Sub (%)

Seg (%)

Supra (%)

SRC= Sub+seg+supra (%)

MFCCs (%)

SRC+MFCC’s (%)

2 3.33 36.67 100 41.11 10 26.67

4 6.67 66.67 100 57.78 30 45.55

8 3.33 80 100 61.11 43.33 57.22

16 3 .33 83.33 100 62.22 73.33 67.77

32 3.33 96.67 93.33 64.44 60 62.56

Page 25: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

131

Fig. 5.8: Five States Ergodic HMM Recognition Performance for Varying Number of Mixture Components of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along with MFC

Page 26: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

132

Table 5.6: Average recognition performance of Sub, Seg and Supra levels of LP residual for ergodic HMM with different number of states.

No. of states

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+ seg+ Supra

(%)

MFCCs (%)

SRC+MFCCs (%)

1 60.47 68.66 99.33 80 91 86.67

2 67.57 73.33 99.33 85 91 90

3 41 77.33 99.33 72.33 72.67 75

4 4 70.67 96.67 57 50 50

5 4 70.67 93.33 56.68 42.33 50

Page 27: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

133

Fig. 5.9: Average Recognition Performance of Ergodic HMM with Different Number of States of a) Sub, Seg and Supra Levels of LP Residual and b) SRC=Sub+Seg+Supra along with MFCCs

Page 28: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

134

Fig. 5.9 and Table 5.6 shows the ergodic HMM speaker recognition

rate for HMMs with different states in each HMM. The system is

trained with 8 speech utterances and tested with 2 speech

utterances per speaker. As illustrated in Figs. 5.9 and Table 5.6,

the average recognition performance was found to be 86.66% for a

HMM with a single state, 90% for a HMM with two states and 50%

for a HMM with five states. From the above details, it has been

observed that the average speaker recognition performance rate is

high for two state compared to others. i.e., from single state to two

states there is an increment and from two to three, four there is a

decrement. At four and five states, the average speaker recognition

performance is equal. However HMM with two state gives high

average speaker recognition performance rate whereas for three,

four and five states there is a decrease average performance of the

speaker which is shown in Table 5.6. Hence, we stopped to

compute the performance of the speaker recognition after two

states. Therefore, it reduces the computation complexity of the

Speaker recognition system. Further the recognition performance

also increases with the increase in number of mixture components.

The following section described about the speaker recognition

performance of HE and RP of LP residual at subsegmental,

segmental and suprasegmental levels.

Page 29: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

135

5.6 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL PROCESSING OF HE AND RP OF LP RESIDUAL FOR

SPEAKER RECOGNITION USING CONTINUOUS ERGODIC HMM

In the previous section, speaker information from the LP residual

was derived by direct processing of the LP residual at the

subsegmental, segmental and suprasegmental levels. The dominant

speaker information present in these three levels of processing

mostly represents the amplitude and sequence information of the

source. When the LP residual is processed directly, the effect of

amplitude values dominates the sequence information around the

instants of glottal closure [114]. Therefore, it might have been

separated the amplitude and phase information using analytical

signal representation of the LP residual are called as HE features

and RP features as explained in the chapter 4.

The performances of these features are given in the Tables 5.7

-5.12. We observed that the performance of RP features is better

than HE features.

Page 30: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

136

Table 5.7: Recognition performance of Sub, Seg and Supra levels of HE of LP residual for single state ergodic HMM.

No. Of mixtures

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+ Seg+ Supra

(%)

MFCCs (%)

SRC+MFCCs (%)

2 3.33 23.33 100 42.33 80 61

4 6.67 53.33 100 53.33 95 74.33

8 10 66.67 100 58.88 95 76.99

16 13.33 93.33 100 68.89 90 80

32 20 100 100 73.33 95 84

Page 31: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

137

Fig. 5.10: Recognition Performance of HE of LP residual for Single-State Ergodic HMMs of a) Sub, Seg and Supra Levels and

b) SRC=Sub+Seg+Supra along with MFCCs

Page 32: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

138

Table 5.8: Recognition performance of Sub, Seg and Supra levels of HE of LP residual for two-states ergodic HMM.

No of Mixtures

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+Seg+Supra

(%)

MFCCs (%)

SRC+MFCCs (%)

2 3.33 20 83.33 35.67 80 55

4 6.67 53.33 96.67 52.33 95 75

8 3.33 76.67 96.67 58.67 95 77.5

16 16.67 96.67 100 70 90 80

32 40 100 100 80 95 87.5

Page 33: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

139

Fig.5.11: Recognition Performance of HE of LP Residual for two-States Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs

Page 34: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

140

Table 5.9: Recognition performance of Sub, Seg and Supra levels of HE of LP residual for three-states ergodic HMM.

No of Mixtur

es

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+Seg+Supra

(%)

MFCCs (%)

SRC+MFCCs (%)

2 3.33 16.67 83.33 33.33 50 47

4 3.33 50 96.67 50 70 60

8 10 60 96.67 60 80 70

16 13.33 83.33 100 70 86.67 78

32 26.67 100 100 80 76.67 78

Page 35: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

141

Fig. 5.12: Recognition Performance of HE of LP Residual for three States Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs

Page 36: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

142

Table 5.10: Recognition performance of Sub, Seg and Supra levels of RP of LP residual for single-state ergodic HMM.

No. Of mixtures

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+ Seg+Supra

(%)

MFCCs (%)

SRC+ MFCCs

(%)

2 100 100 100 100 80 90

4 100 100 100 100 95 97.5

8 93.33 100 100 97.67 95 96.67

16 100 100 100 100 90 95

32 86.67 100 100 100 95 97.5

Page 37: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

143

Fig. 5.13: Recognition Performance of RP of LP Residual for Single-State Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs

Page 38: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

144

Table 5.11: Recognition Performance of Sub, Seg and Supra Levels of RP of LP Residual for Two-States Ergodic HMM.

No. Of mixtures

Sub (%)

Seg (%)

Supra (%)

Source=Sub+ Seg+Supra (%)

MFCCs (%)

Source+MFCCs

(%)

2 96.67 100 100 99 80 89.5

4 86.67 100 100 95.56 95 95.33

8 96.67 100 100 98.89 95 97

16 96.67 96.67 100 96.67 90 93.33

32 100 100 100 100 95 97.5

Page 39: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

145

Fig. 5.14: Recognition Performance of RP of LP Residual for two States Ergodic HMMs of a) Sub, Seg and Supra

Levels and b) SRC=Sub+Seg+Supra along with MFCCs

Page 40: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

146

Table 5.12: Recognition performance of Sub, Seg and Supra levels of HE of LP residual for three-states ergodic HMM.

No. Of

mixtures

Sub (%)

Seg (%)

Supra (%)

SRC=Sub+ Seg+Supra

(%)

MFCCs (%)

SRC+ MFCCs

(%)

2 100 100 100 100 50 75

4 86.67 96.67 100 95.56 70 82.77

8 93.33 93.33 100 95.44 80 87.5

16 83.33 93.33 100 93.17 86.67 90

32 100 96.67 100 98.33 76.67

90

Page 41: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

147

Fig. 5.15: Recognition Performance of RP of LP Residual for three States Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs

Page 42: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

148

5.7 COMBINIG EVIDENCES FROM EACH LEVEL OF HE AND RP OF LP RESIDUAL

The functioning of individual HE and RP features is pathetic

compared to the corresponding residual features. Because, as

mentioned earlier, HE and RP features independently represent two

different aspects of the information that is present in the residual

features. The integration of HE and RP features is proved to give

better performance than the residual features. The following Tables

5.13-5.15 and Figs 5.16-5.18 indicate the robustness of speaker

recognition system using the combination of HE and RP features.

Page 43: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

149

Table 5.13: Recognition performance of sub, seg and supra levels of HE and RP of LP residual for single state ergodic HMM.

No. of mixtures

Sub of HE and

RP (%)

Seg of HE and

RP (%)

Supra of HE and

RP (%)

SRC=Sub+seg+supra

(%)

MFCCs (%)

SRC+ MFCCs

(%)

2 51.33 61.33 100 73.33 86.67 76.67

4 54 78 100 80 95 87.5

8 55 83.33 100 84 95 90

16 56.67 96.67 100 86.67 90 90

32 54 100 100 90 95 93

Page 44: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

150

Fig. 5.16: Recognition Performance of Combination of HE and RP of LP Residual for Single State Ergodic HMM of a) Sub, Seg

and Supra levels and b) SRC=Sub+Seg+Supra along with MFCCs.

Page 45: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

151

Table 5.14: Recognition performance of Sub, Seg and Supra levels of HE and RP of LP residual for two-states ergodic HMM.

No. of mixture

Sub of HE and

RP (%)

Seg of HE and

RP (%)

Supra of HE and

RP (%)

SRC=Sub+seg+sup

ra (%)

MFCCs (%)

SRC+MFCCs (%)

2 56.67 60 91.33 73.33 80 76.67

4 50 76.67 98.33 80 95 86.67

8 50 88.33 99 88.33 95 96.67

16 56.67 95.67 100 95 90 98.67

32 75 100 100 100 95 100

Page 46: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

152

Fig. 5.17: Recognition Performance of Combination of HE and RP of LP Residual for Two-States Ergodic HMM of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs

Page 47: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

153

Table 5.15: Recognition performance of sub, seg and supra levels of HE and RP of LP residual for Three-states ergodic HMM.

No. of mixtures

Sub of HE and RP (%)

Seg of HE and

RP (%)

Supra of HE and

RP (%)

Source=Sub+seg+supra

(%)

MFCCs (%)

Source +MFCCs

(%)

2 56.67 60 91.33 70 50 66.67

4 50 76.67 98.33 76.67 70 76.67

8 50 88.33 99 86.67 80 86.67

16 56.67 95.67 100 95 86.67 98.67

32 75 100 100 100 76.67 100

Page 48: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

154

Fig. 5.18: Recognition Performance of Combination of HE and RP of LP Residual for Three-States Ergodic HMMs of a) Sub, Seg and Supra Levels and b) SRC=Sub+Seg+Supra along with MFCCs.

Page 49: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

155

5.8 COMPARISION STUDY ON SPEAKER RECOGNITION USING ERGODIC HMM FOR LP RESIDUAL, HE, RP AND FUSION OF HE and RP FEATURES.

Table 5.16: Comparison to other recent speaker models

It has been observed that in HMM the RP Features i.e., the sequence

information gives cent percent results when compared to HE i.e.,

Amplitude information and LP Residual i.e., both Amplitude and

Sequence information gives 80% and 95.68% respectively. Since, In LP

residual Amplitude information dominates the Sequence information.

We also derived that the integration of HE and RP extracts more

efficiently by HMM.

By comparing Tables 4.12, 4.13 of Chapter 4 and Table 5.16 of

Chapter 5, we demonstrated that HMM extracts efficiently the

sequence information and intra-speaker variability than GMM.

Databases

Type of Features

(%)

Sub

(%) Seg

(%) Supra

(%)

Source=Sub+Seg+Supra

(%)

MFCCs

(%) SRC+MFCCs

(%)

Proposed MODEL

HMM using Database is

TIMIT

LP residual 85.33 100 100 95.68 95 95.33

HE 40 100 100 80 95 87.5

RP 100 100 100 100 95 97.5

HE+RP 75 100 100 100 95 100

Page 50: CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND ...shodhganga.inflibnet.ac.in/bitstream/10603/30961/14/14...107 CHAPTER-5 SUBSEGMENTAL, SEGMENTAL AND SUPRASEGMENTAL FEATURES FOR SPEAKER RECOGNITION

156

5.9 SUMMARY

Different nature of source features are derived from LP

residual, HE and RP of LP residual at subsegmental, segmental and

suprasegmental levels which are used for the development of

speaker recognition using Hidden Markov models. Hidden Markov

model exploited the sequence information and it is powerful to

model intra speaker variability. Hence, the performance at

suprasegmental level of LP residual, HE of LP residual and RP of LP

residual is best than the other two levels of features. The score

levels are combined at each level of LP residual, HE of LP residual

and RP of LP residual individually to improve speaker recognition

performance. The fusion of HE of LP residual and RP of LP residual

enhances the Speaker recognition system performance compared

with Speaker recognition system using individual features alone.


Recommended