Download - A study of models anda priori threshold updating in speaker verification

A Study of Models and a Priori Threshold Updating in Speaker

Verification

Tomoko Matsui

NTT Human Interface Laboratories, Yokosuka, Japan 239-0847

Takashi Nishitani and Sadaoki Furui

Tokyo Institute of Technology, Tokyo, Japan 152-0033

SUMMARY

This article explores a method for speaker verifica-

tion to appropriately set the threshold value used to judge

the identities of individual speakers and also considers

methods to update speaker models and the robustness of the

speaker models so as to make the models more robust with

respect to utterance variations, using a small amount of data

for updating that was recently uttered. The speaker model

is represented by the hidden Markov model (HMM). In the

model updating, the parameters of the speaker HMM are

estimated from the data for updating and the current pa-

rameter values. For setting the threshold, the new threshold

value for each speaker is set with the initial value as a value

that is passed to an FA rate higher than the equal error rate

(a value for which the false rejection rate and the false

acceptance rate are equal), which is calculated from the data

for updating and which steadily approaches the value that

was passed to the equal error rate in concert with updating

of the speaker HMM. The results of evaluating this method

using text-dependent and text-prompted speaker verifica-

tion experiments with twenty speakers shows that the aver-

age error rate fell by roughly 40% for the text-independent

type and by roughly 80% for the text-prompted type as

compared to when the model and threshold value are not

updated. © 1999 Scripta Technica, Syst Comp Jpn, 30(13):

96�105, 1999

Key words: Speaker verification; model updating;

threshold value; hidden Markov model; text-independent

type; text-prompted type.

1. Introduction

In speaker verification, the degree of similarity (the

likelihood value) between an input speech and a speaker

model is calculated and compared with a particular thresh-

old to identify the speaker. In real systems, the initial model

for each speaker is often created from a small amount of

data uttered in one session, with consideration for the

burden on each speaker to produce speech data. In order to

obtain high speaker verification performance, past data

should be accumulated for each speaker, and a model of

the speaker should be re-created using data that includes

various utterance variations (primarily generated from dif-

ferences in the contents of utterances and the session-

to-session utterance variations).

Furui proposed a method to revise speaker templates

using templates generated from combining the most recent

multiple utterances in speaker verification using template

matching [1]. In this method, utterances from the distant

past are not used, and instead the most recent multiple

utterances are used because the voice changes over the

course of time. In speaker verification using the hidden

CCC0882-1666/99/130096-10

© 1999 Scripta Technica

Systems and Computers in Japan, Vol. 30, No. 13, 1999Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J81-D-II, No. 2, February 1998, pp. 268�276

96

Markov model (HMM), although there are few research

examples related to revisions of a speaker HMM, recently

Setlur and Jacobs [2] reported on a method to re-create

speaker HMM while making the model structure more

complex so as to better capture utterance variations corre-

sponding to the accumulated data volume. Although meth-

ods to re-create speaker HMM such as this method are

effective, in order to use past accumulated data, the required

memory capacity and processing capacity increase. In this

article, the authors consider a method to consecutively

revise the speaker HMM so as to make utterance variations

robust using a small amount of data of the most recent

utterances. In this method, the parameter values of the

speaker HMM that is re-created using accumulated past

data are approximately calculated using data for updating

and current parameter values.

In addition, in real speaker verification systems, the

threshold values used to determine the speaker must be set

beforehand. When setting the threshold values beforehand,

two kinds of errors may occur: false rejection (FR) and false

acceptance (FA). The FR rate can be obtained from the

speaker model and the data; the FA rate can be obtained

from the speaker model and data from people other than the

speaker. In this article, the authors consider a method to set

the threshold value close to a value for which the FR rate

and the FA rate are equal as the updating of the speaker

HMM proceeds.

When the training data is limited, or when speech

data uttered over multiple sessions cannot be used, it is very

difficult to estimate the FR rate for data different from the

training data of the speaker (open data). Figure 1 shows the

relationship between the threshold value for FA and FR

calculated from the training data and open data. Note that

the likelihood value for the HMM was used in determining

the speaker. The I in the figure is the threshold value used

as a target in this method; that is to say, it represents the

threshold value that is passed to the equal error rate for the

open data. When the error rate is low the slippage in the

error rate curves for the training data and the open data is

larger for the FR rate than for the FA rate. The FR rate curve

for the open data tends to shift toward the direction (left

side) where the threshold value is lower when the FR rate

curve for the training data is the standard (see Appendix).

As a result, it is considered that the target threshold value I

in this method can be the value passed to the high FA rate

on the FA rate curve for the training data. This trend is

remarkable when the speaker model is not sufficiently

robust with respect to utterance variations. Furui reports on

a method to deal with this problem in which the threshold

value is set using the mean and the standard deviation of the

distribution of the similarity values between the speaker

model and data for people other than the speaker. In other

words, this method considers only the FA rate for which the

precision of the estimates is comparatively good. He shows

[1] the effectiveness on speaker verification using template

matching. In the current research, the authors consider a

method in which the threshold value passed to an FA rate

higher than the equal error rate calculated from the data for

updating is set as the initial value, and the threshold value

is slowly approached to the equal error rate as the updating

of the speaker HMM proceeds.

In this article, the creation of the initial model for each

speaker is referred to as �training,� and the updating of later

models is �updating.�

2. Updating Models

In this method, the parameter values of the speaker

HMMs that are re-created using accumulated past data are

approximately calculated using data for updating and the

current parameter values in order to reduce the memory

capacity and processing capacity required to update the

speaker HMM.

When using all the speech data uttered over sessions

1 to U and re-creating the speaker HMM based on the

maximum likelihood (ML) estimation (repeated estimation

using the Baum�Welch algorithm), the state s, the mean

vector msm for the mixture m, and the weighting factor wsm

in the HMM (diagonal Gaussian distribution) take the form

shown in Eqs. (1) and (2):

Fig. 1. Relationship between threshold and FA and FR

rates calculated from training and open data.

(1)

97

Here, u (= {1, 2, . . . , U}) is the discriminator that expresses

the session, and Tu refers to the length of the speech data

uttered in the session u. Furthermore, csmtu expresses the

probability of the vector Xtu observed for the state s and the

mixture m at time t in the session u in the HMM.

In this method, the mean vector msm in Eq. (1) and the

weighing factor wsm in Eq. (2) approximates PsmU in Eq. (3)

and wsmU in Eq. (4), respectively, using the speech data (data

for updating) uttered in the most recent session U.

When the speaker HMM updated consecutively

through sessions 1 to u � 1 using this method is used as the

initial HMM, Csmt represents the probability of the vector

Xut observed in the HMM estimated based on the ML

estimation using the data for updating in the session u. In

addition, msmU and wsm

U are defined by Eqs. (5) and (6).

They express the mean vector and the weighting factor for

each mixture of the HMM estimated based on the ML

estimation using only the data uttered in the session U. Note

that the third item in the numerator and the denominator in

Eq. (4) is a supplementary item set experimentally.

Here ¦ t 1U � 1 ¦ t 1

Tu C smtu corresponds to the frame

number included in all the past (sessions 1 to U � 1)

data that was used to estimate the mean vector

PsmU � 1. ¦m 1

M ¦ u 1U � 1 ¦ t 1

Tu C smt

u corresponds to the frame

number used to estimate the weighing factor W smU � 1 .

Therefore, PsmU and W sm

U can be interpreted fundamentally

as the weighted average of the original P smU � 1 and w sm

U � 1 and

the m smU and w sm

U (estimated based on the ML estimation

using data for updating) with the frame number used to

estimate the respective parameters.

If Eqs. (3) and (4) are used recursively (when ignor-

ing the supplementary items), they take on a form in which

the C smtu from Eqs. (1) and (2) is replaced with c smt

u . The

probability C smtu is different from the c smt

u of the HMM

estimated based on the ML estimation using all the c data

uttered in the session u. This article assumes that the differ-

ence between the two is not great, and the approximation is

performed as follows:

3. Setting the Threshold Value

As explained in Section 1, the shift width with respect

to that of the data for updating for the FR rate curve for the

open data in Fig. 1 can be considered as being reduced in

accordance with the increase in the amount of data (in other

words, the robustness with respect to the utterance variation

of the speaker HMM) (see Appendix). Therefore, in this

method, as the amount of data increases and the speaker

(2)

(3)

(4)

(5)

(6)

(7)

Fig. 2. Method of updating the a priori threshold.

98

HMM becomes more robust with respect to utterance vari-

ation, the threshold value I is calculated from the data for

updating and is set so as to steadily approach the value

passed to the equal error rate from the value passed to the

high FA rate (Eq. (8)).

In order to obtain the equal error rate related to the

data for updating, the FR rate is calculated as follows. First,

for each speaker, the speaker model is updated using all the

remaining data sets except for the data set for updating. The

likelihood values with the data set are then calculated. Next,

the FR rate is calculated from these likelihood values.

Here, I0 represents the threshold value passed to the

equal error rate for the data for updating, and I1 is the

threshold value (set experimentally) passed to the high FA

rate (Fig. 2). The value w is a parameter that controls the

rate at which the threshold value for the open data ap-

proaches the value for the equal error rate for the data for

updating. It is expressed (Fig. 3) using a function [for

instance, Eq. (9) includes a free parameter a] that attenuates

the value slowly. The value k represents the number of times

an update occurred for a speaker model (k = 0, 1, 2, . . .).

4. Experimental Conditions

The data used was text data uttered by twenty men

for five hours (T1�T5) over a roughly twelve-month period.

The sentences were drawn from 503 sentences [3] of ATR

continuous transcriptions. Among the men, ten were used

(8)

Fig. 3. Parameter for controlling the convergence of the

threshold: w.

(9)

Table 1. Transcription of the training/updating and testing data

Sentence text

Training 1. Tobu jiyuu o eru koto wa jinrui no yume datta. (Free flight was a dream of humanity.)

2. Hajimete ruuburu bijutsukan e haitta no wa juuyon nen mae no koto da. (I went to the Louvre Museum for

the first time fourteen years ago.)

Updating 3. Jibun no jitsuryoku wa jibun ga ichiban yoku shitte iru hazu da. (You know your own strengths best.)

4. Kore made shonen yakyuu mamasan barei nado chiiki supootsu o sasoe shimin ni mitchaku shite kita no wa

musuu no borantia datta. (An endless number of volunteers, including little league mothers, has supported

local sports and brought communities together.)

5. Ginzake no tamago o shuunyuu shite fuka sase kaichuu de sodateru youshoku hajimatte iru. (People have

started importing silver salmon eggs, hatching them, and raising the hatchlings in the sea.)

(6)�(10). The texts in these five sets were different for each speaker and session. They were drawn from 503

sentences of the ATR continuous transcriptions.

Test 1. Yobou ya kenkou kanri rehabiriteishon no tame no seido o juujitsu shite iku hitsuyou ga arou. (A system for

prevention and rehabilition of health management must be implemented.)

2. Karada no takasa wa hyaku nanajuu senchi hodo de me ga ookiku yaya futotte iru. (The person was around

170 centimeters tall, was somewhat heavy, and had large eyes.)

3. Oogoe o dashisugite kasuregoe ni natte shimau. (Talking in a loud voice will make a person hoarse.)

4. Tashizan hikizan wa dekinakute mo e wa egakeru. (Although they cannot add or subtract, they can draw

pictures.)

5. Dono heya no ishou ni mo asobigokoro ga afurete ite tanoshii. (The design of every room was full of a

delightful playfulness of spirit.)

99

as customers and the other ten as impostors. The cepstral

coefficients were calculated by LPC analysis with an order

of 16, a frame length of 32 ms, and a frame period of 8 ms.

The sampling rate was 12 kHz. In training, a model of each

speaker was created using the ten sentences uttered in the

session T1, then the model was updated using the sentences

uttered over sessions T2 and T3. In testing, five sentences

uttered over the two sessions T4 and T5 were individually

used. Among the ten statements in each session used for

training and updating, five sentences were the same for all

speakers and all sessions, and the other five sentences

differed for each session and each speaker. The text used in

testing was different from the text used in training and

updating, though they were the same for all speakers and

all sessions (Table 1). The total number of test utterances

used in the experiment numbered 200 sentences [speakers

(20) u session (2) u sentences (5)]. The average duration of

each sentence was 4.2 seconds.

Evaluation was performed using test-independent

and text-prompted [4] speaker verification experiments. In

this experiment, the HMM likelihood value was normalized

using the likelihood normalization method [5] based on the

a posteriori probability. In the text-independent experi-

ments, the likelihood value was evaluated using the average

speaker verification error rate. The number of verification

experiments for each customer was 110 [{customer (1) +

impostor (10)} u session (2) u sentences (5)]. A continu-

ous-type HMM (one state, sixty-four diagonal Gaussian

distributions) was used as a model for each speaker. In

addition, the Baum�Welch algorithm (estimation repeated

three times) was used when creating speaker HMM for the

first time. In the text-prompted experiments, the likelihood

value was evaluated using the speaker and text verification

error rate for when the utterance of a different text given by

the speaker was included as an utterance to be rejected. The

number of verification experiments for each customer was

150 [{correct text from customer (1) + different text from

customer (4) + impostor (10)} u sessions (2) u sentences

(5)]. The voice of each speaker was modeled using the

semicontinuous HMM (3 states, 256 Gaussian distribu-

tions) for each phoneme. In addition, when creating the

phoneme HMM for each speaker for the first time, speaker-

independent phoneme HMMs were used as initial models,

and those initial models were adapted to the speaker using

the Baum�Welch algorithms (estimation repeated three

times).

Note that in this experiment, I1 from Eq. (8) and a

from Eq. 9 were set a posteriori so as to be optimal values.

In the text-independent experiment, the FA rate was set to

0.25, roughly 1% higher than the equal error rate. In the

text-prompted experiment, the FA rate was set to 0.55,

roughly 6% higher than the equal error rate. With respect

to I1, the decision to set a value that would pass an FA rate

higher for the text-prompted experiment than for the text-

independent experiment was because in the text-prompted

experiment, speaker HMMs were made for each phoneme

and the number of HMM parameters to be estimated was

higher. When using the same 10 sentences to update the

model, the text-independent experiment had a compara-

tively better estimated precision for each parameter. As a

result, the difference between the FR rate curve for the data

for updating and the FR rate curve for the open data was

considered small. Note that the a priori setting of I1 and a

are included in future topics.

5. Results

5.1. Effects of updating the models

Here, experiments were performed for the following

two types of updating.

Batch type: Updating using ten sentences once.

Consecutive type: Updating using one sentence after

the other.

The average error rate is shown in Fig. 4 when updat-

ing using the batch type and in Fig. 5 when updating using

the consecutive type for text-independent and text-

prompted speaker verification. Note that in order to see only

the effects of updating the speaker model, the threshold

value was evaluated using the equal error rate for when it

could be set to a value for which the FR rate and the FA rate

calculated from the recognition data were equal. �Recalcu-

lation� estimates the mean and variance vectors and weight-

ing factors for each mixture for the speaker HMM via the

Baum�Welch algorithms (estimation repeated three times)

using all the data, including the data for updating and the

Fig. 4. Verification error rates when updating the

reference models using ten consecutive sentences.

100

training data. It refers to the method in which the speaker

HMM is recreated. The authors� method performed almost

as well as the recalculation method. The average error rate

for when the authors� method updated the speaker HMMs

and a total of twenty sentences were used for updating was

almost 20% (test-independent type: 0.4/2.5; test-prompted

type: 0.1/0.5) for when the speaker HMMs were not up-

dated both in the text-independent type and the text-

prompted type.

Furthermore, Table 2 shows a rough estimate of the

required memory capacity (roughly equal to calculation

capacity) for the two types of updating in this method when

the memory capacity required by the recalculation method

is 1. The number of times updated k = 1 or 2 for the batch

type is shown for when updating is performed using each

ten sentences uttered during sessions T2 and T3. The num-

ber of times updated k = 5 through 20 is shown for when a

total of 5 to 20 sentences uttered during sessions T2 and T3

are individually used. In the recalculation method, the

required memory size increases for every ten sentences in

the batch type for each update and in the consecutive type

for every sentence, in addition to the ten sentences used to

create the initial model. However, in the authors� method,

the required memory size is only the portion for the data for

updating. As such, in either type of updating, the required

memory size for this method is smaller than for the recal-

culation method.

These results show the effectiveness of updating

speaker models using the authors� method.

5.2. Effects of resetting the threshold value

The effects of resetting the threshold value were

investigated for when the speaker model was updated

(batch type) using this method and using ten sentences

once. Figure 6 shows the FA and FR rates for when the

threshold value was updated again using this method. The

number of times updated k = 0, 1, 2 are shown for train-

ing/updating performed using each of ten sentences uttered

during session T1, T2, and T3. In addition, Fig. 7 shows the

case of resetting the threshold value to a value roughly 1%

higher for the equal error rate in the text-independent type

and roughly 6% higher in the text-prompted type. Figure 8

shows the case of setting the initial threshold value to a

value roughly 1% higher for the equal error rate in the

text-independent type and roughly 6% higher in the text-

prompted type, and then not resetting it thereafter. Figure 9

shows the FA and FR rates (and their average error rates)

when resetting the threshold value to the value of the equal

error rate for the data for updating.

The equal error rate for the FA and FR rates when

resetting the threshold value using this method (Fig. 6, k =

2) is roughly 40% (2.2/5.5) in the text-independent type and

30% (2.8/9.1) in the text-prompted type, compared to when

setting the value for the equal error rate for data for updating

(Fig. 9, k = 2). Compared to when not updating the model

and the threshold value (Fig. 6, k = 0), it is almost 40%

(2.2/5.1) for the text-independent type and 80% (2.8/3.6)

for the text-prompted type. These results show the effec-

Fig. 5. Verification error rates when updating the

reference models successively using each sentence.

Table 2. Memory size required by proposed method normalized by that of the recalculation method

Batch type

Number of times updated k

1 2

1/2 (= 10/20) 1/3 (= 10/30)

Consecutive type

Number of times updated k

5 10 15 20

1/15 1/20 1/25 1/30

101

tiveness of the authors� method. In addition, a comparison

of resetting the threshold value using this method (Fig. 6)

and resetting the value passed to the high FA rate for which

the threshold value is calculated from the data for updating

(Fig. 7) shows that the FR rate for this method is lower. This

is because the threshold value for the open data moves from

the value that is calculated from the data for updating and

that passes the high FA rate to the value that passes the equal

error rate as the updating for this method progresses. More-

over, when the threshold value is not reset (Fig. 8), the equal

error rate is higher. As a result, the threshold value must be

reset in concert with updating the model. In Fig. 8, the FA

rate is not particularly high since the likelihood value of the

speaker model with respect to the data for other people is

somewhat larger. This is because the feature parameter

space that is expressed by the speaker HMM expands and

Fig. 6. FA and FR rates when using proposed method of

resetting the a priori threshold.

Fig. 7. FA and FR rates when resetting the a priori

threshold to where the FA rate is higher than the error

rate at the equal error threshold.

Fig. 8. FA and FR rates without resetting the a priori

threshold.

Fig. 9. FA and FR rates when resetting the a priori

threshold to the equal error threshold.

102

the likelihood value of the speaker model becomes larger

with respect to the speaker data as the speaker HMM

becomes more robust against utterance variations.

6. Conclusions

This article reported on a method to update speaker

models using a small amount of data for updating uttered

recently and a method to set the threshold value in concert

with the updating for speaker models. With respect to

updating speaker models, when comparing this method and

the conventional method of re-creating the speaker model

using the ML estimation, the memory required (also

roughly equal to the processing capacity) for this method,

which is an approximation of the conventional method, is

not only lower, but the performance is also roughly the

same. The error rate in speaker verification (the threshold

value is set a posteriori to the value pass to the equal error

rate) when updating the speaker model using this method

is around 20% for when no updating is performed in the

text-independent type and in the text-prompted type. In

addition to updating the model, the equal error rate when

resetting the threshold value drops to 30% for the text-

independent type and 40% for the text-prompted type in

comparison to when the threshold value is set to the value

of the equal error rate for the data for updating. The equal

error rate drops to roughly 40% for the text-independent

type and 80% for the text-prompted type compared to when

the model and threshold value are not updated.

In the future, the authors will perform experiments

with an increased number of data sessions. They will inves-

tigate methods for making robust estimates of the variance

vector of each mixture distribution for speaker HMM, in

addition to confirming the effectiveness of this method.

Acknowledgments. The authors would like to

thank the members of the Furui Special Research Office of

the NTT Human Interface Laboratories for valuable re-

source material.

REFERENCES

1. Furui S. Cepstral analysis technique for automatic

speaker verification. Trans IEEE 1981;ASSP-29:

254�272.

2. Setlur A, Jacobs T. Results of a speaker verification

service trial using HMM models. Proc Eurospeech

1995;I-53-56.

3. Kurihara T, Niosaka Y, Takeda K, Abe M. Creation of

data space for ATR Japanese speech for research

(separate book I, continuous voice text). 1989;TR-I-

0086.

4. Matsui T, Furui S. Text-prompted speaker recogni-

tion. Shingakuron 1996;J79-D-II:647�656.

5. Matsui T, Furui S. Likelihood normalization using a

phoneme- and speaker-independent model for

speaker verification. Speech Communication 1995;

17:109�116.

APPENDIX

FA and FR Rate Curves for Training Data

and Open Data

Figure A-1 shows the FA and FR rate curves for every

speaker when using ten sentences uttered over session T1

as training data and when using five statements for each

speaker uttered over sessions T2 through T4 as open data

in a text-independent speaker verification experiment (Sec-

tion 4). The solid lines represent the FA and FR rate curves

calculated from the training data. Training data for all

customers (nine speakers) excluding the speaker was used

to obtain the FA rate. The dotted lines represent each FA

and FR rate curve calculated from the open data for each

session. Ten speakers who are different from the customers

in Section 4 were used as impostors so as to make it open

for speakers when obtaining the FA rate. When the error

rate was especially low (error rate under 25%), the differ-

ence in the error rate curves for the training data and the

open data was larger for the FR rate than the FA rate. The

FR rate curve for the open data tended to shift toward a

smaller threshold (to the left) when the FR rate curve for

the training data was used as a standard. In addition, the

inconsistency in the error rate curves for the open data in

each session was larger for the FR rate than for the FA rate.

The FR rate was more affected by the session-to-session

utterance variations.

With respect to the parameter w that controls the rate

at which the threshold value in Section 3 approaches the

equal error rate value in a text-independent speaker verifi-

cation experiment (Section 4), Fig. A-2 shows the optimal

value for each updating session and each speaker when each

of the ten sentences uttered during sessions T2 and T3 were

updated using batch-type processing. The dotted line rep-

resents the least-squares regression line. There was a ten-

dency for the optimal value of w to become smaller with

each update. As a result, the shift width with respect to that

of the data for updating for the FR rate curve for open data

tends to grow smaller as the speaker HMMs are updated

and the robustness increases for the utterance variations.

103

Fig. A-1. FA and FR rates calculated by using training and open data for each speaker.

Fig. A-2. Optimal values of parameter w for each speaker

(w controls the convergence of the threshold).

104

AUTHORS (from left to right)

Tomoko Matsui (member) completed her M.S. degree in 1986 at the Department of Engineering and Intelligence, Tokyo

Institute of Technology. She joined NTT in 1986 and has been conducting research on speaker recognition at the NTT Human

Interface Laboratories ever since. She also holds a D.Eng. degree. She is a member of IEEE and the Japan Acoustics

Organization. She received the Essay Prize in 1993.

Takashi Nishitani graduated in 1996 from the Department of Electronics, Tokyo Institute of Technology. He is currently

working on an M.E. degree from the graduate school at the same university.

Sadaoki Furui (member) graduated in 1968 from the Department of Engineering and Statistics, Tokyo University. He

received his M.S. from the graduate school at the same university in 1970. He joined the NTT Electronics and Communications

Research Center in the same year and has been conducting research on voice recognition, speaker recognition, and speech

awareness ever since. He was a member of Bell Laboratories from 1978 to 1979. He is currently a professor at the Tokyo Institute

of Technology. He also holds the D.Eng. degree. He received the Yonezawa Prize in 1975, the Essay Prize in 1988 and 1993,

the Author Prize in 1990, the Science and Technology Office prize in 1989, and the IEEE ASSP Society Senior Award. He wrote

Digital Voice Processing; Digital Speech Processing, Synthesis, and Recognition; and Acoustics and Speech Engineering, and

he edited Advances in Speech Signal Processing. He is a fellow of IEEE and the U.S. Acoustics Organization.

105