A Study of Models and a Priori Threshold Updating in Speaker
Verification
Tomoko Matsui
NTT Human Interface Laboratories, Yokosuka, Japan 239-0847
Takashi Nishitani and Sadaoki Furui
Tokyo Institute of Technology, Tokyo, Japan 152-0033
SUMMARY
This article explores a method for speaker verifica-
tion to appropriately set the threshold value used to judge
the identities of individual speakers and also considers
methods to update speaker models and the robustness of the
speaker models so as to make the models more robust with
respect to utterance variations, using a small amount of data
for updating that was recently uttered. The speaker model
is represented by the hidden Markov model (HMM). In the
model updating, the parameters of the speaker HMM are
estimated from the data for updating and the current pa-
rameter values. For setting the threshold, the new threshold
value for each speaker is set with the initial value as a value
that is passed to an FA rate higher than the equal error rate
(a value for which the false rejection rate and the false
acceptance rate are equal), which is calculated from the data
for updating and which steadily approaches the value that
was passed to the equal error rate in concert with updating
of the speaker HMM. The results of evaluating this method
using text-dependent and text-prompted speaker verifica-
tion experiments with twenty speakers shows that the aver-
age error rate fell by roughly 40% for the text-independent
type and by roughly 80% for the text-prompted type as
compared to when the model and threshold value are not
updated. © 1999 Scripta Technica, Syst Comp Jpn, 30(13):
96�105, 1999
Key words: Speaker verification; model updating;
threshold value; hidden Markov model; text-independent
type; text-prompted type.
1. Introduction
In speaker verification, the degree of similarity (the
likelihood value) between an input speech and a speaker
model is calculated and compared with a particular thresh-
old to identify the speaker. In real systems, the initial model
for each speaker is often created from a small amount of
data uttered in one session, with consideration for the
burden on each speaker to produce speech data. In order to
obtain high speaker verification performance, past data
should be accumulated for each speaker, and a model of
the speaker should be re-created using data that includes
various utterance variations (primarily generated from dif-
ferences in the contents of utterances and the session-
to-session utterance variations).
Furui proposed a method to revise speaker templates
using templates generated from combining the most recent
multiple utterances in speaker verification using template
matching [1]. In this method, utterances from the distant
past are not used, and instead the most recent multiple
utterances are used because the voice changes over the
course of time. In speaker verification using the hidden
CCC0882-1666/99/130096-10
© 1999 Scripta Technica
Systems and Computers in Japan, Vol. 30, No. 13, 1999Translated from Denshi Joho Tsushin Gakkai Ronbunshi, Vol. J81-D-II, No. 2, February 1998, pp. 268�276
96
Markov model (HMM), although there are few research
examples related to revisions of a speaker HMM, recently
Setlur and Jacobs [2] reported on a method to re-create
speaker HMM while making the model structure more
complex so as to better capture utterance variations corre-
sponding to the accumulated data volume. Although meth-
ods to re-create speaker HMM such as this method are
effective, in order to use past accumulated data, the required
memory capacity and processing capacity increase. In this
article, the authors consider a method to consecutively
revise the speaker HMM so as to make utterance variations
robust using a small amount of data of the most recent
utterances. In this method, the parameter values of the
speaker HMM that is re-created using accumulated past
data are approximately calculated using data for updating
and current parameter values.
In addition, in real speaker verification systems, the
threshold values used to determine the speaker must be set
beforehand. When setting the threshold values beforehand,
two kinds of errors may occur: false rejection (FR) and false
acceptance (FA). The FR rate can be obtained from the
speaker model and the data; the FA rate can be obtained
from the speaker model and data from people other than the
speaker. In this article, the authors consider a method to set
the threshold value close to a value for which the FR rate
and the FA rate are equal as the updating of the speaker
HMM proceeds.
When the training data is limited, or when speech
data uttered over multiple sessions cannot be used, it is very
difficult to estimate the FR rate for data different from the
training data of the speaker (open data). Figure 1 shows the
relationship between the threshold value for FA and FR
calculated from the training data and open data. Note that
the likelihood value for the HMM was used in determining
the speaker. The I in the figure is the threshold value used
as a target in this method; that is to say, it represents the
threshold value that is passed to the equal error rate for the
open data. When the error rate is low the slippage in the
error rate curves for the training data and the open data is
larger for the FR rate than for the FA rate. The FR rate curve
for the open data tends to shift toward the direction (left
side) where the threshold value is lower when the FR rate
curve for the training data is the standard (see Appendix).
As a result, it is considered that the target threshold value I
in this method can be the value passed to the high FA rate
on the FA rate curve for the training data. This trend is
remarkable when the speaker model is not sufficiently
robust with respect to utterance variations. Furui reports on
a method to deal with this problem in which the threshold
value is set using the mean and the standard deviation of the
distribution of the similarity values between the speaker
model and data for people other than the speaker. In other
words, this method considers only the FA rate for which the
precision of the estimates is comparatively good. He shows
[1] the effectiveness on speaker verification using template
matching. In the current research, the authors consider a
method in which the threshold value passed to an FA rate
higher than the equal error rate calculated from the data for
updating is set as the initial value, and the threshold value
is slowly approached to the equal error rate as the updating
of the speaker HMM proceeds.
In this article, the creation of the initial model for each
speaker is referred to as �training,� and the updating of later
models is �updating.�
2. Updating Models
In this method, the parameter values of the speaker
HMMs that are re-created using accumulated past data are
approximately calculated using data for updating and the
current parameter values in order to reduce the memory
capacity and processing capacity required to update the
speaker HMM.
When using all the speech data uttered over sessions
1 to U and re-creating the speaker HMM based on the
maximum likelihood (ML) estimation (repeated estimation
using the Baum�Welch algorithm), the state s, the mean
vector msm for the mixture m, and the weighting factor wsm
in the HMM (diagonal Gaussian distribution) take the form
shown in Eqs. (1) and (2):
Fig. 1. Relationship between threshold and FA and FR
rates calculated from training and open data.
(1)
97
Here, u (= {1, 2, . . . , U}) is the discriminator that expresses
the session, and Tu refers to the length of the speech data
uttered in the session u. Furthermore, csmtu expresses the
probability of the vector Xtu observed for the state s and the
mixture m at time t in the session u in the HMM.
In this method, the mean vector msm in Eq. (1) and the
weighing factor wsm in Eq. (2) approximates PsmU in Eq. (3)
and wsmU in Eq. (4), respectively, using the speech data (data
for updating) uttered in the most recent session U.
When the speaker HMM updated consecutively
through sessions 1 to u � 1 using this method is used as the
initial HMM, Csmt represents the probability of the vector
Xut observed in the HMM estimated based on the ML
estimation using the data for updating in the session u. In
addition, msmU and wsm
U are defined by Eqs. (5) and (6).
They express the mean vector and the weighting factor for
each mixture of the HMM estimated based on the ML
estimation using only the data uttered in the session U. Note
that the third item in the numerator and the denominator in
Eq. (4) is a supplementary item set experimentally.
Here ¦ t 1U � 1 ¦ t 1
Tu C smtu corresponds to the frame
number included in all the past (sessions 1 to U � 1)
data that was used to estimate the mean vector
PsmU � 1. ¦m 1
M ¦ u 1U � 1 ¦ t 1
Tu C smt
u corresponds to the frame
number used to estimate the weighing factor W smU � 1 .
Therefore, PsmU and W sm
U can be interpreted fundamentally
as the weighted average of the original P smU � 1 and w sm
U � 1 and
the m smU and w sm
U (estimated based on the ML estimation
using data for updating) with the frame number used to
estimate the respective parameters.
If Eqs. (3) and (4) are used recursively (when ignor-
ing the supplementary items), they take on a form in which
the C smtu from Eqs. (1) and (2) is replaced with c smt
u . The
probability C smtu is different from the c smt
u of the HMM
estimated based on the ML estimation using all the c data
uttered in the session u. This article assumes that the differ-
ence between the two is not great, and the approximation is
performed as follows:
3. Setting the Threshold Value
As explained in Section 1, the shift width with respect
to that of the data for updating for the FR rate curve for the
open data in Fig. 1 can be considered as being reduced in
accordance with the increase in the amount of data (in other
words, the robustness with respect to the utterance variation
of the speaker HMM) (see Appendix). Therefore, in this
method, as the amount of data increases and the speaker
(2)
(3)
(4)
(5)
(6)
(7)
Fig. 2. Method of updating the a priori threshold.
98
HMM becomes more robust with respect to utterance vari-
ation, the threshold value I is calculated from the data for
updating and is set so as to steadily approach the value
passed to the equal error rate from the value passed to the
high FA rate (Eq. (8)).
In order to obtain the equal error rate related to the
data for updating, the FR rate is calculated as follows. First,
for each speaker, the speaker model is updated using all the
remaining data sets except for the data set for updating. The
likelihood values with the data set are then calculated. Next,
the FR rate is calculated from these likelihood values.
Here, I0 represents the threshold value passed to the
equal error rate for the data for updating, and I1 is the
threshold value (set experimentally) passed to the high FA
rate (Fig. 2). The value w is a parameter that controls the
rate at which the threshold value for the open data ap-
proaches the value for the equal error rate for the data for
updating. It is expressed (Fig. 3) using a function [for
instance, Eq. (9) includes a free parameter a] that attenuates
the value slowly. The value k represents the number of times
an update occurred for a speaker model (k = 0, 1, 2, . . .).
4. Experimental Conditions
The data used was text data uttered by twenty men
for five hours (T1�T5) over a roughly twelve-month period.
The sentences were drawn from 503 sentences [3] of ATR
continuous transcriptions. Among the men, ten were used
(8)
Fig. 3. Parameter for controlling the convergence of the
threshold: w.
(9)
Table 1. Transcription of the training/updating and testing data
Sentence text
Training 1. Tobu jiyuu o eru koto wa jinrui no yume datta. (Free flight was a dream of humanity.)
2. Hajimete ruuburu bijutsukan e haitta no wa juuyon nen mae no koto da. (I went to the Louvre Museum for
the first time fourteen years ago.)
Updating 3. Jibun no jitsuryoku wa jibun ga ichiban yoku shitte iru hazu da. (You know your own strengths best.)
4. Kore made shonen yakyuu mamasan barei nado chiiki supootsu o sasoe shimin ni mitchaku shite kita no wa
musuu no borantia datta. (An endless number of volunteers, including little league mothers, has supported
local sports and brought communities together.)
5. Ginzake no tamago o shuunyuu shite fuka sase kaichuu de sodateru youshoku hajimatte iru. (People have
started importing silver salmon eggs, hatching them, and raising the hatchlings in the sea.)
(6)�(10). The texts in these five sets were different for each speaker and session. They were drawn from 503
sentences of the ATR continuous transcriptions.
Test 1. Yobou ya kenkou kanri rehabiriteishon no tame no seido o juujitsu shite iku hitsuyou ga arou. (A system for
prevention and rehabilition of health management must be implemented.)
2. Karada no takasa wa hyaku nanajuu senchi hodo de me ga ookiku yaya futotte iru. (The person was around
170 centimeters tall, was somewhat heavy, and had large eyes.)
3. Oogoe o dashisugite kasuregoe ni natte shimau. (Talking in a loud voice will make a person hoarse.)
4. Tashizan hikizan wa dekinakute mo e wa egakeru. (Although they cannot add or subtract, they can draw
pictures.)
5. Dono heya no ishou ni mo asobigokoro ga afurete ite tanoshii. (The design of every room was full of a
delightful playfulness of spirit.)
99
as customers and the other ten as impostors. The cepstral
coefficients were calculated by LPC analysis with an order
of 16, a frame length of 32 ms, and a frame period of 8 ms.
The sampling rate was 12 kHz. In training, a model of each
speaker was created using the ten sentences uttered in the
session T1, then the model was updated using the sentences
uttered over sessions T2 and T3. In testing, five sentences
uttered over the two sessions T4 and T5 were individually
used. Among the ten statements in each session used for
training and updating, five sentences were the same for all
speakers and all sessions, and the other five sentences
differed for each session and each speaker. The text used in
testing was different from the text used in training and
updating, though they were the same for all speakers and
all sessions (Table 1). The total number of test utterances
used in the experiment numbered 200 sentences [speakers
(20) u session (2) u sentences (5)]. The average duration of
each sentence was 4.2 seconds.
Evaluation was performed using test-independent
and text-prompted [4] speaker verification experiments. In
this experiment, the HMM likelihood value was normalized
using the likelihood normalization method [5] based on the
a posteriori probability. In the text-independent experi-
ments, the likelihood value was evaluated using the average
speaker verification error rate. The number of verification
experiments for each customer was 110 [{customer (1) +
impostor (10)} u session (2) u sentences (5)]. A continu-
ous-type HMM (one state, sixty-four diagonal Gaussian
distributions) was used as a model for each speaker. In
addition, the Baum�Welch algorithm (estimation repeated
three times) was used when creating speaker HMM for the
first time. In the text-prompted experiments, the likelihood
value was evaluated using the speaker and text verification
error rate for when the utterance of a different text given by
the speaker was included as an utterance to be rejected. The
number of verification experiments for each customer was
150 [{correct text from customer (1) + different text from
customer (4) + impostor (10)} u sessions (2) u sentences
(5)]. The voice of each speaker was modeled using the
semicontinuous HMM (3 states, 256 Gaussian distribu-
tions) for each phoneme. In addition, when creating the
phoneme HMM for each speaker for the first time, speaker-
independent phoneme HMMs were used as initial models,
and those initial models were adapted to the speaker using
the Baum�Welch algorithms (estimation repeated three
times).
Note that in this experiment, I1 from Eq. (8) and a
from Eq. 9 were set a posteriori so as to be optimal values.
In the text-independent experiment, the FA rate was set to
0.25, roughly 1% higher than the equal error rate. In the
text-prompted experiment, the FA rate was set to 0.55,
roughly 6% higher than the equal error rate. With respect
to I1, the decision to set a value that would pass an FA rate
higher for the text-prompted experiment than for the text-
independent experiment was because in the text-prompted
experiment, speaker HMMs were made for each phoneme
and the number of HMM parameters to be estimated was
higher. When using the same 10 sentences to update the
model, the text-independent experiment had a compara-
tively better estimated precision for each parameter. As a
result, the difference between the FR rate curve for the data
for updating and the FR rate curve for the open data was
considered small. Note that the a priori setting of I1 and a
are included in future topics.
5. Results
5.1. Effects of updating the models
Here, experiments were performed for the following
two types of updating.
Batch type: Updating using ten sentences once.
Consecutive type: Updating using one sentence after
the other.
The average error rate is shown in Fig. 4 when updat-
ing using the batch type and in Fig. 5 when updating using
the consecutive type for text-independent and text-
prompted speaker verification. Note that in order to see only
the effects of updating the speaker model, the threshold
value was evaluated using the equal error rate for when it
could be set to a value for which the FR rate and the FA rate
calculated from the recognition data were equal. �Recalcu-
lation� estimates the mean and variance vectors and weight-
ing factors for each mixture for the speaker HMM via the
Baum�Welch algorithms (estimation repeated three times)
using all the data, including the data for updating and the
Fig. 4. Verification error rates when updating the
reference models using ten consecutive sentences.
100
training data. It refers to the method in which the speaker
HMM is recreated. The authors� method performed almost
as well as the recalculation method. The average error rate
for when the authors� method updated the speaker HMMs
and a total of twenty sentences were used for updating was
almost 20% (test-independent type: 0.4/2.5; test-prompted
type: 0.1/0.5) for when the speaker HMMs were not up-
dated both in the text-independent type and the text-
prompted type.
Furthermore, Table 2 shows a rough estimate of the
required memory capacity (roughly equal to calculation
capacity) for the two types of updating in this method when
the memory capacity required by the recalculation method
is 1. The number of times updated k = 1 or 2 for the batch
type is shown for when updating is performed using each
ten sentences uttered during sessions T2 and T3. The num-
ber of times updated k = 5 through 20 is shown for when a
total of 5 to 20 sentences uttered during sessions T2 and T3
are individually used. In the recalculation method, the
required memory size increases for every ten sentences in
the batch type for each update and in the consecutive type
for every sentence, in addition to the ten sentences used to
create the initial model. However, in the authors� method,
the required memory size is only the portion for the data for
updating. As such, in either type of updating, the required
memory size for this method is smaller than for the recal-
culation method.
These results show the effectiveness of updating
speaker models using the authors� method.
5.2. Effects of resetting the threshold value
The effects of resetting the threshold value were
investigated for when the speaker model was updated
(batch type) using this method and using ten sentences
once. Figure 6 shows the FA and FR rates for when the
threshold value was updated again using this method. The
number of times updated k = 0, 1, 2 are shown for train-
ing/updating performed using each of ten sentences uttered
during session T1, T2, and T3. In addition, Fig. 7 shows the
case of resetting the threshold value to a value roughly 1%
higher for the equal error rate in the text-independent type
and roughly 6% higher in the text-prompted type. Figure 8
shows the case of setting the initial threshold value to a
value roughly 1% higher for the equal error rate in the
text-independent type and roughly 6% higher in the text-
prompted type, and then not resetting it thereafter. Figure 9
shows the FA and FR rates (and their average error rates)
when resetting the threshold value to the value of the equal
error rate for the data for updating.
The equal error rate for the FA and FR rates when
resetting the threshold value using this method (Fig. 6, k =
2) is roughly 40% (2.2/5.5) in the text-independent type and
30% (2.8/9.1) in the text-prompted type, compared to when
setting the value for the equal error rate for data for updating
(Fig. 9, k = 2). Compared to when not updating the model
and the threshold value (Fig. 6, k = 0), it is almost 40%
(2.2/5.1) for the text-independent type and 80% (2.8/3.6)
for the text-prompted type. These results show the effec-
Fig. 5. Verification error rates when updating the
reference models successively using each sentence.
Table 2. Memory size required by proposed method normalized by that of the recalculation method
Batch type
Number of times updated k
1 2
1/2 (= 10/20) 1/3 (= 10/30)
Consecutive type
Number of times updated k
5 10 15 20
1/15 1/20 1/25 1/30
101
tiveness of the authors� method. In addition, a comparison
of resetting the threshold value using this method (Fig. 6)
and resetting the value passed to the high FA rate for which
the threshold value is calculated from the data for updating
(Fig. 7) shows that the FR rate for this method is lower. This
is because the threshold value for the open data moves from
the value that is calculated from the data for updating and
that passes the high FA rate to the value that passes the equal
error rate as the updating for this method progresses. More-
over, when the threshold value is not reset (Fig. 8), the equal
error rate is higher. As a result, the threshold value must be
reset in concert with updating the model. In Fig. 8, the FA
rate is not particularly high since the likelihood value of the
speaker model with respect to the data for other people is
somewhat larger. This is because the feature parameter
space that is expressed by the speaker HMM expands and
Fig. 6. FA and FR rates when using proposed method of
resetting the a priori threshold.
Fig. 7. FA and FR rates when resetting the a priori
threshold to where the FA rate is higher than the error
rate at the equal error threshold.
Fig. 8. FA and FR rates without resetting the a priori
threshold.
Fig. 9. FA and FR rates when resetting the a priori
threshold to the equal error threshold.
102
the likelihood value of the speaker model becomes larger
with respect to the speaker data as the speaker HMM
becomes more robust against utterance variations.
6. Conclusions
This article reported on a method to update speaker
models using a small amount of data for updating uttered
recently and a method to set the threshold value in concert
with the updating for speaker models. With respect to
updating speaker models, when comparing this method and
the conventional method of re-creating the speaker model
using the ML estimation, the memory required (also
roughly equal to the processing capacity) for this method,
which is an approximation of the conventional method, is
not only lower, but the performance is also roughly the
same. The error rate in speaker verification (the threshold
value is set a posteriori to the value pass to the equal error
rate) when updating the speaker model using this method
is around 20% for when no updating is performed in the
text-independent type and in the text-prompted type. In
addition to updating the model, the equal error rate when
resetting the threshold value drops to 30% for the text-
independent type and 40% for the text-prompted type in
comparison to when the threshold value is set to the value
of the equal error rate for the data for updating. The equal
error rate drops to roughly 40% for the text-independent
type and 80% for the text-prompted type compared to when
the model and threshold value are not updated.
In the future, the authors will perform experiments
with an increased number of data sessions. They will inves-
tigate methods for making robust estimates of the variance
vector of each mixture distribution for speaker HMM, in
addition to confirming the effectiveness of this method.
Acknowledgments. The authors would like to
thank the members of the Furui Special Research Office of
the NTT Human Interface Laboratories for valuable re-
source material.
REFERENCES
1. Furui S. Cepstral analysis technique for automatic
speaker verification. Trans IEEE 1981;ASSP-29:
254�272.
2. Setlur A, Jacobs T. Results of a speaker verification
service trial using HMM models. Proc Eurospeech
1995;I-53-56.
3. Kurihara T, Niosaka Y, Takeda K, Abe M. Creation of
data space for ATR Japanese speech for research
(separate book I, continuous voice text). 1989;TR-I-
0086.
4. Matsui T, Furui S. Text-prompted speaker recogni-
tion. Shingakuron 1996;J79-D-II:647�656.
5. Matsui T, Furui S. Likelihood normalization using a
phoneme- and speaker-independent model for
speaker verification. Speech Communication 1995;
17:109�116.
APPENDIX
FA and FR Rate Curves for Training Data
and Open Data
Figure A-1 shows the FA and FR rate curves for every
speaker when using ten sentences uttered over session T1
as training data and when using five statements for each
speaker uttered over sessions T2 through T4 as open data
in a text-independent speaker verification experiment (Sec-
tion 4). The solid lines represent the FA and FR rate curves
calculated from the training data. Training data for all
customers (nine speakers) excluding the speaker was used
to obtain the FA rate. The dotted lines represent each FA
and FR rate curve calculated from the open data for each
session. Ten speakers who are different from the customers
in Section 4 were used as impostors so as to make it open
for speakers when obtaining the FA rate. When the error
rate was especially low (error rate under 25%), the differ-
ence in the error rate curves for the training data and the
open data was larger for the FR rate than the FA rate. The
FR rate curve for the open data tended to shift toward a
smaller threshold (to the left) when the FR rate curve for
the training data was used as a standard. In addition, the
inconsistency in the error rate curves for the open data in
each session was larger for the FR rate than for the FA rate.
The FR rate was more affected by the session-to-session
utterance variations.
With respect to the parameter w that controls the rate
at which the threshold value in Section 3 approaches the
equal error rate value in a text-independent speaker verifi-
cation experiment (Section 4), Fig. A-2 shows the optimal
value for each updating session and each speaker when each
of the ten sentences uttered during sessions T2 and T3 were
updated using batch-type processing. The dotted line rep-
resents the least-squares regression line. There was a ten-
dency for the optimal value of w to become smaller with
each update. As a result, the shift width with respect to that
of the data for updating for the FR rate curve for open data
tends to grow smaller as the speaker HMMs are updated
and the robustness increases for the utterance variations.
103
Fig. A-1. FA and FR rates calculated by using training and open data for each speaker.
Fig. A-2. Optimal values of parameter w for each speaker
(w controls the convergence of the threshold).
104
AUTHORS (from left to right)
Tomoko Matsui (member) completed her M.S. degree in 1986 at the Department of Engineering and Intelligence, Tokyo
Institute of Technology. She joined NTT in 1986 and has been conducting research on speaker recognition at the NTT Human
Interface Laboratories ever since. She also holds a D.Eng. degree. She is a member of IEEE and the Japan Acoustics
Organization. She received the Essay Prize in 1993.
Takashi Nishitani graduated in 1996 from the Department of Electronics, Tokyo Institute of Technology. He is currently
working on an M.E. degree from the graduate school at the same university.
Sadaoki Furui (member) graduated in 1968 from the Department of Engineering and Statistics, Tokyo University. He
received his M.S. from the graduate school at the same university in 1970. He joined the NTT Electronics and Communications
Research Center in the same year and has been conducting research on voice recognition, speaker recognition, and speech
awareness ever since. He was a member of Bell Laboratories from 1978 to 1979. He is currently a professor at the Tokyo Institute
of Technology. He also holds the D.Eng. degree. He received the Yonezawa Prize in 1975, the Essay Prize in 1988 and 1993,
the Author Prize in 1990, the Science and Technology Office prize in 1989, and the IEEE ASSP Society Senior Award. He wrote
Digital Voice Processing; Digital Speech Processing, Synthesis, and Recognition; and Acoustics and Speech Engineering, and
he edited Advances in Speech Signal Processing. He is a fellow of IEEE and the U.S. Acoustics Organization.
105