+ All Categories
Home > Documents > Utterance verification in continuous speech recognition: decoding and training procedures

Utterance verification in continuous speech recognition: decoding and training procedures

Date post: 01-Dec-2023
Category:
Upload: independent
View: 0 times
Download: 0 times
Share this document with a friend
14
126 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000 Utterance Verification in Continuous Speech Recognition: Decoding and Training Procedures Eduardo Lleida, Member, IEEE and Richard C. Rose, Member, IEEE Abstract—This paper introduces a set of acoustic modeling and decoding techniques for utterance verication (UV) in hidden Markov model (HMM) based continuous speech recognition (CSR). Utterance verification in this work implies the ability to de- termine when portions of a hypothesized word string correspond to incorrectly decoded vocabulary words or out-of-vocabulary words that may appear in an utterance. This capability is imple- mented here as a likelihood ratio (LR) based hypothesis testing procedure for verifying individual words in a decoded string. There are two UV techniques that are presented here. The first is a procedure for estimating the parameters of UV models during training according to an optimization criterion which is directly related to the LR measure used in UV. The second technique is a speech recognition decoding procedure where the “best” decoded path is defined to be that which optimizes a LR criterion. These techniques were evaluated in terms of their ability to improve UV performance on a speech dialog task over the public switched telephone network. The results of an experimental study presented in the paper shows that LR based parameter estimation results in a significant improvement in UV performance for this task. The study also found that the use of the LR based decoding procedure, when used in conjunction with models trained using the LR criterion, can provide as much as an 11% improvement in UV performance when compared to existing UV procedures. Finally, it was also found that the performance of the LR decoder was highly dependent on the use of the LR criterion in training acoustic models. Several observations are made in the paper concerning the formation of confidence measures for UV and the interaction of these techniques with statistical language models used in ASR. Index Terms—Acoustic modeling, confidence measures, discrim- inative training, large vocabulary continuous speech recognition, likelihood ratio, utterance verification. I. INTRODUCTION I N AUTOMATIC speech recognition applications it is often necessary to provide a mechanism for verifying the accuracy of portions of recognition hypotheses. This paper describes ut- terance verification (UV) procedures for hidden Markov model (HMM) based continuous speech recognition that are based on a Likelihood Ratio (LR) criterion. Utterance verification is most often considered as a hypothesis testing problem. Existing tech- niques rely on a speech recognizer to produce a hypothesized word or string of words along with hypothesized word bound- Manuscript received June 13, 1997; revised November 2, 1998. This work was supported in part under a personal grant from the DGICYT and under Grants TIC95-0884-C04-04 and TIC95-1022-C05-02 from CICYT, Spain. The asso- ciate editor coordinating the review of this manuscript and approving it for pub- lication was Dr. Wu Chou. E. Lleida is with the Centro Politéecnico Superior, University of Zaragoza, Zaragoza, Spain. R. C. Rose is with AT&T Laboratories—Research, Florham Park, NJ 07932 USA. Publisher Item Identifier S 1063-6676(00)01724-7. aries obtained through Viterbi segmentation of the utterance. A measure of confidence is then assigned to the hypothesized string, and the hypothesized word labels are accepted or rejected by comparing the confidence measure to a decision threshold. LR based UV procedures generally form this confidence mea- sure as the ratio of the target hypothesis HMM model likelihood with respect to an alternate hypothesis model likelihood. There- fore, it is further assumed that there exists an alternate hypoth- esis model that is used in the likelihood ratio test. The defini- tion of the alternate hypothesis model, how its parameters are trained, and what its role is in utterance verification has been investigated in many different contexts. The techniques described in this paper address several of the short-comings associated with this general class of utterance verification techniques. The first short-coming that is addressed is the fact that UV is implemented as a “two-pass” procedure, where a ML decoder produces a string hypothesis which is ver- ified by LR based decision rule. It is argued in Section II that there is no reason why the decoder itself cannot be configured to produce a hypothesized string that optimizes a LR criterion. A “single-pass” approach to utterance verification is proposed [8]. In this approach, instead of finding an optimum sequence of HMM states to optimize a maximum likelihood (ML) criterion, the speech recognition decoder is designed to identify the state sequence that directly optimizes a LR criterion. As a result, the decoder itself produces recognition hypotheses which optimize a LR based confidence measure and the scores produced by the decoder can themselves be combined to form a confidence mea- sure for use in the hypothesis test. It must be acknowledged, however, that it is difficult or im- possible to make any general statements concerning the rela- tionship between the decoded state sequence obtained from a LR based decoder and that obtained from a ML decoder. There are additional issues beyond the optimization criterion used in the decoder. The HMM acoustic model training procedures, the HMM model topology, and simplifying assumptions made in implementing the decoder can all have a significant impact on ASR performance. As a result, we are careful not to make any general claims as to the comparative ASR performance between the LR decoder and the ML decoder. The results that are pre- sented in the paper are concerned primarily with utterance ver- ification performance in an attempt to demonstrate empirically that the algorithms and architectures presented in the paper are genuinely appropriate for that purpose. Issues relating to the de- sign of the LR decoder and assumptions made for improving its efficiency and robustness are discussed in Section II. A second short-coming is that hypothesis testing theory in general provides no guidance for either choosing the form of S1063–6676/00$10.00 © 2000 IEEE
Transcript

126 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000

Utterance Verification in Continuous SpeechRecognition: Decoding and Training Procedures

Eduardo Lleida, Member, IEEEand Richard C. Rose, Member, IEEE

Abstract—This paper introduces a set of acoustic modelingand decoding techniques for utterance verication (UV) in hiddenMarkov model (HMM) based continuous speech recognition(CSR). Utterance verification in this work implies the ability to de-termine when portions of a hypothesized word string correspondto incorrectly decoded vocabulary words or out-of-vocabularywords that may appear in an utterance. This capability is imple-mented here as a likelihood ratio (LR) based hypothesis testingprocedure for verifying individual words in a decoded string.There are two UV techniques that are presented here. The first isa procedure for estimating the parameters of UV models duringtraining according to an optimization criterion which is directlyrelated to the LR measure used in UV. The second technique is aspeech recognition decoding procedure where the “best” decodedpath is defined to be that which optimizes a LR criterion. Thesetechniques were evaluated in terms of their ability to improveUV performance on a speech dialog task over the public switchedtelephone network. The results of an experimental study presentedin the paper shows that LR based parameter estimation results ina significant improvement in UV performance for this task. Thestudy also found that the use of the LR based decoding procedure,when used in conjunction with models trained using the LRcriterion, can provide as much as an 11% improvement in UVperformance when compared to existing UV procedures. Finally, itwas also found that the performance of the LR decoder was highlydependent on the use of the LR criterion in training acousticmodels. Several observations are made in the paper concerningthe formation of confidence measures for UV and the interactionof these techniques with statistical language models used in ASR.

Index Terms—Acoustic modeling, confidence measures, discrim-inative training, large vocabulary continuous speech recognition,likelihood ratio, utterance verification.

I. INTRODUCTION

I N AUTOMATIC speech recognition applications it is oftennecessary to provide a mechanism for verifying the accuracy

of portions of recognition hypotheses. This paper describes ut-terance verification (UV) procedures for hidden Markov model(HMM) based continuous speech recognition that are based ona Likelihood Ratio (LR) criterion. Utterance verification is mostoften considered as a hypothesis testing problem. Existing tech-niques rely on a speech recognizer to produce a hypothesizedword or string of words along with hypothesized word bound-

Manuscript received June 13, 1997; revised November 2, 1998. This workwas supported in part under a personal grant from the DGICYT and under GrantsTIC95-0884-C04-04 and TIC95-1022-C05-02 from CICYT, Spain. The asso-ciate editor coordinating the review of this manuscript and approving it for pub-lication was Dr. Wu Chou.

E. Lleida is with the Centro Politéecnico Superior, University of Zaragoza,Zaragoza, Spain.

R. C. Rose is with AT&T Laboratories—Research, Florham Park, NJ 07932USA.

Publisher Item Identifier S 1063-6676(00)01724-7.

aries obtained through Viterbi segmentation of the utterance.A measure of confidence is then assigned to the hypothesizedstring, and the hypothesized word labels are accepted or rejectedby comparing the confidence measure to a decision threshold.LR based UV procedures generally form this confidence mea-sure as the ratio of the target hypothesis HMM model likelihoodwith respect to an alternate hypothesis model likelihood. There-fore, it is further assumed that there exists an alternate hypoth-esis model that is used in the likelihood ratio test. The defini-tion of the alternate hypothesis model, how its parameters aretrained, and what its role is in utterance verification has beeninvestigated in many different contexts.

The techniques described in this paper address several of theshort-comings associated with this general class of utteranceverification techniques. The first short-coming that is addressedis the fact that UV is implemented as a “two-pass” procedure,where a ML decoder produces a string hypothesis which is ver-ified by LR based decision rule. It is argued in Section II thatthere is no reason why the decoder itself cannot be configuredto produce a hypothesized string that optimizes a LR criterion.A “single-pass” approach to utterance verification is proposed[8]. In this approach, instead of finding an optimum sequence ofHMM states to optimize a maximum likelihood (ML) criterion,the speech recognition decoder is designed to identify the statesequence that directly optimizes a LR criterion. As a result, thedecoder itself produces recognition hypotheses which optimizea LR based confidence measure and the scores produced by thedecoder can themselves be combined to form a confidence mea-sure for use in the hypothesis test.

It must be acknowledged, however, that it is difficult or im-possible to make any general statements concerning the rela-tionship between the decoded state sequence obtained from aLR based decoder and that obtained from a ML decoder. Thereare additional issues beyond the optimization criterion used inthe decoder. The HMM acoustic model training procedures, theHMM model topology, and simplifying assumptions made inimplementing the decoder can all have a significant impact onASR performance. As a result, we are careful not to make anygeneral claims as to the comparative ASR performance betweenthe LR decoder and the ML decoder. The results that are pre-sented in the paper are concerned primarily withutterance ver-ification performance in an attempt to demonstrate empiricallythat the algorithms and architectures presented in the paper aregenuinely appropriate for that purpose. Issues relating to the de-sign of the LR decoder and assumptions made for improving itsefficiency and robustness are discussed in Section II.

A second short-coming is that hypothesis testing theory ingeneral provides no guidance for either choosing the form of

S1063–6676/00$10.00 © 2000 IEEE

LLEIDA AND ROSE: UTTERANCE VERIFICATION IN CONTINUOUS SPEECH RECOGNITION 127

the densities used in the LR test or for estimating their param-eters. The theory assumes that these densities are known, anddeals mainly with the problem of determining whether or not agiven data set is well explained by the densities [6]. We attemptto over-come this lack of a criterion for specifying the model pa-rameters used in the LR test by using a training procedure for es-timating model parameters which also directly optimizes a like-lihood ratio criterion [14]. There have been previous attempts atestimating model parameters for word spotting by maximizinga likelihood ratio criterion [13], [18]. There has also been morerecent work in applying this class of techniques to more con-strained tasks [12], [17]. The goal of the training procedure thatis described here is to increase the average value of LR for cor-rectly hypothesized vocabulary words and decrease the averagevalue of LR for false alarms. An iterative discriminative trainingalgorithm is used which increases the average likelihood ratiofor correctly hypothesized vocabulary words and decreases theaverage likelihood ratio for false alarms.

Finally, the third short-coming of UV techniques is theirsusceptibility to modeling inaccuracies that are associatedwith HMM acoustic models. It is often the case that localmis-matches between the speech utterance and HMM modelcan have a catastrophic effect on the accumulated score usedfor making a global decision. To mitigate these effects, severalword level confidence measures corresponding to non-linearweightings of frame and segment level confidence measuresare investigated.

In order to motivate the work described in this paper, it isimportant to provide some minimal understanding of the con-text in which utterance verification techniques are applied. It isoften the case that human–machine interfaces are configured sothat a large percentage of the input utterances are ill-formed.This is the case for user-initiated human–machine dialog [8],[20] automation of telecommunications services [19], and is cer-tainly true in case of machine interpretation of human-humandialog [5], [14]. Utterance verification in this context impliesthe ability to detect legitimate vocabulary words in an utterancethat is assumed to contain words or phrases which are not ex-plicitly modeled in the speech recognizer. However, even wheninput utterances tend to be well-formed and contain relativelyfew out-of-vocabulary (OOV) words, UV techniques can be ap-plied to determine when decoded word hypotheses are correct.These procedures have been shown to improve performance ina number of applications where OOV utterances are relativelyrare including telephone based connected digit and commandword recognition [1], [11], [16].

The UV techniques described in this paper are applied to aspontaneous speech database query task over the public switchedtelephone network. Callers can query the system for informationconcerning movies playing at local theaters. The spoken queriescontained a mixture of well-formed and ill-formed utterances.Therefore, UV performance as described here will describe theability of UV techniques to not only reject out-of-vocabularyspeech but also to detect speech recognition errors.

The organization of the paper is as follows. Section II de-scribes the utterance verification strategies and introduces theconcept of likelihood ratio decoder and a set of word level con-fidence measures. A training algorithm based on the likelihood

ratio decoder is presented in Section III. Section IV describes thetelephone based speech dialog task and the associated speechcorpus used in the experiments. Finally, Section V describes anexperimental study of utterance verification performance whenthe techniques described here are applied to the telephone basedspeech dialog task.

II. UTTERANCE VERIFICATION

The purpose of this section is to describe a decoding algo-rithm which optimizes a likelihood ratio criterion. The decodingalgorithm is designed to find the best path through a combinedspace of target and alternate hypothesis HMM model states, andtakes the form of a modified Viterbi algorithm. The section hasfive parts. First, the utterance verification paradigm that is usedin this work is described. This includes the definition of theLR test as applied in continuous speech recognition (CSR) andUV performance measures derived under the Neyman–Pearsonhypothesis testing formalism. Second, the single-pass utteranceverification procedure is introduced and contrasted with thetwo-pass approach. The third and largest part of this sectionderives the expression for the decoding algorithm in a mannersimilar to the derivation of the Viterbi algorithm for maximumlikelihood HMM decoding. Fourth, more efficient versions ofthe decoding procedure are derived by imposing constraintson the relationship between the target hypothesis model andalternate hypothesis model state sequences. Finally, a set ofword level confidence measures are defined.

A. Utterance Verification Paradigms

It is assumed that the input to the speech recognizer is asequence of feature vectors representinga speech utterance containing both within-vocabulary andout-of-vocabulary words. Borrowing from the nomenclatureof the signal detection literature, the within-vocabulary wordswill be referred to here as belonging to the class of null or“target” hypotheses and the out-of-vocabulary words will bereferred to as “imposters” or belonging to the class of alternatehypotheses. Incorrectly decoded vocabulary words appearingas substitutions or insertions in the output string from therecognizer will also be referred to as belonging to the class ofalternate hypotheses. It is also assumed that the output of therecognizer is a singlewordstring hypothesisof length Of course, all the discussion in this section can beeasily generalized to the problems of verifying one of multiplecomplete or partial string hypotheses produced as part of anN-best list or word lattice as well.

Under the assumptions of the Neyman–Pearson hypothesistesting framework, both the target hypothesis and alternate hy-pothesis densities are assumed to be known. In the context ofUV, it will be assumed that the target or correct hypothesismodel and the alternate model corresponding to a hypoth-esized vocabulary word are both hidden Markov models. A LRbased hypothesis test can then be defined as:

null hypothesis, was generated by the target modelalternative hypothesis, was generated by the alterna-

tive model

(1)

128 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000

Fig. 1. Two-pass utterance verification where a word string and associated segmentation boundaries that are hypothesized by a maximum likelihood CSR decoderare verified in a second stage using a likelihood ratio test.

where τ is a decision threshold. Given the target hypoth-esis probability which models correctly decodedhypotheses for a given recognition unit and the alternatehypothesis probability which models the incorrectlydecoded hypotheses, (1) describes a test which accepts orrejects the hypothesis that the observation sequencecorre-sponds to a legitimate vocabulary word by comparing the LRto a threshold. As the model parameters for a particular unitare not known, they have to be estimated from the training dataassuming the form of the density is known.

While the hypothesis testing theory provides guidance forchoosing the value for the thresholdτ and estimating the proba-bility of error given the probability densities, it does not provideany guidance for dealing with a number of problems that areassociated with speech recognition. For example, the observa-tion vectors might be associated with a hypothesized wordthat is embedded in a string of words. In this case, it is not clearwhich observation vectors in the continuous utterance shouldbe associated with the hypothesized word. This is a problemthat is associated with two-pass utterance verification and isaddressed by the decoding algorithm described in this section.An additional problem is the fact that language models are usedin CSR as an additional knowledge source to constrain networksearch. The effect of the language model is to reduce the averagebranching factor at each decision node in the search proce-dure. As a result, a word may be correctly decoded due to a lowbranching factor at a given decision node even if the acousticprobability, is relatively weak. In those cases wherea correct word hypothesis is decoded on the strength of thelanguage model despite a weak acoustic model, the hypothesistest in (1) may prove misleading. This is an important problemwhich is not directly addressed by the work described in thispaper.

The use of word error rate or word accuracy for measuringperformance in ASR assumes that the all input words are fromwithin the prespecified vocabulary and that all words are de-coded by the recognizer with equal confidence. The measuresthat are used here to evaluate utterance verification proceduresare borrowed from the signal detection field. The most familiarof these measures is the receiver operating characteristic curve(ROC). There are two types of errors associated with utteranceverification systems. Type I errors or false rejections correspondto words that are correctly decoded in the recognizer but re-jected by the utterance verification process. Type II errors or

false alarms correspond to incorrectly decoded word insertionsand substitutions which are generated by the recognizer and alsoaccepted by the utterance verification system. Sweeping out arange of threshold settings in (1) results in a range of oper-ating points where each operating point corresponds to a dif-ferent trade-off between type I and type II errors. The ROC cor-responds to a plot of the type I versus type II errors plotted forthreshold settings Another performancemeasure is used in Section V to describe the sensitivity of theutterance verification procedure with respect to the setting of thethresholdτ in (1). This is a plot of the sum of type I and type IIerrors plotted versusτ.

B. Utterance Verification Procedures

1) Two-Pass Procedure:Utterance verification can be ap-plied as a procedure for verifying whether the observation vec-tors in an utterance associated with individual word hypothesesgenerated by a speech recognition decoder correspond to the hy-pothesized word label. In this case, speech recognition and utter-ance verification are considered to be two separate procedures.For a continuous utterance, maximum likelihood decoding re-lies on a set of “recognition” hidden Markov models to producea sequence of hypothesized word labels and hypothesized wordboundaries. Utterance verification is then applied as a secondpass, applying a hypothesis test to the word hypotheses pro-duced by the decoder. This procedure is summarized in Fig. 1and will be referred to here as “two-pass” utterance verification.

In general terms, the hypothesis test is based on the computa-tion of the likelihood ratio measure given in (1); however, sev-eral empirically determined measures have been applied to im-prove utterance verification performance. Several of these “con-fidence measures” are described in Section II-E. These con-fidence measures correspond to different ways of integratinglocal likelihood ratios and are applied in Section V to both thetwo-pass and the one-pass utterance verification strategies.

2) One-Pass Procedure:There is no fundamental reasonwhy the processes of generating a hypothesized word stringand generating confidence measures for verifying hypothesizedwords must be performed in two stages. Both processes couldbe performed in a single stage if the ASR decoder was designedto optimize a criterion related to the confidence measure used inUV. It will be shown in this section that one can easily derive adecoding algorithm that optimizes a LR criterion, and as result,provide LR based word level confidence scores directly. Fig. 2

LLEIDA AND ROSE: UTTERANCE VERIFICATION IN CONTINUOUS SPEECH RECOGNITION 129

Fig. 2. One-pass utterance verification where the optimum decoded string isthat which directly maximizes a likelihood ratio criterion.

shows a simple block diagram illustrating how the functions ofdecoding and verification are integrated into a single procedure.

C. Likelihood Ratio Decoder

Given an HMM model and an observation sequenceitis well known that the Viterbi algorithm can be used to solve theMaximum Likelihood decoding problem. This simply meansthat there is a straight-forward mechanism for obtaining theHMM state sequence which maximizes thelikelihood of the observations with respect to the model, or

(2)

There are many tutorial references on HMMs that can be con-sulted for a discussion of issues related to HMM decoding andtraining including [10]. It is shown here that, using similar rea-soning, one can derive a mechanism which represents a solutionto the likelihood ratio decoding problem. If and in (1) areboth HMMs, then the LR in (1) can be rewritten as

(3)

A discrete state sequence through the model spacedefined by and can be written as

(4)

where in (4) is the number of observations in the utteranceand is a pair of integer indices representing theHMM state at time

The likelihood ratio decoding problem thus corresponds toobtaining the state sequencethat maximizes the likelihoodratio of the data with respect to and or

(5)

(6)

Equation (6) expands the observation probability in terms ofthe HMM parameters. These parameters include the observa-tion densities which correspond to the

Fig. 3. A possible three-dimensional HMM search space. An example of anoptimum state sequence according to the criterion given in (5) is shown as asequence of indices in the combined target model–alternate model state lattice.

probability of state emitting observation vector attime The HMM parameters also include the state transitionprobabilities from HMM state to state and the initialstate probabilities

The process of identifying the optimum state sequence inthe combined target hypothesis model and alternate hypothesismodel state space is illustrated by the diagram in Fig. 3. The op-timal state sequence can correspond to an arbitrary path in thisspace under the constraints that it originates from one of a setof initial states at time and terminates in a final state attime Fig. 3 also illustrates the case where the optimumtarget and alternate hypothesis model state sequences are identi-fied separately as is the case in a two-pass approach. In this case,two independent search procedures are performed through thetwo dimensional state space lattices along the andplanes in Fig. 3. It is clear that the likelihood ratio decoding pro-vides a much larger space of possible solutions at the expenseof an exponential increase in computational complexity.

A Viterbi-like decoding algorithm can be derived to solvefor the state sequence in (6). This follows immediately afterdefining the following parameters from (6):

(7)

(8)

(9)

The expression for finding the best paththrough the combinedstate space lattice shown in Fig. 3 can be obtained from (6) interms of these new parameters as

(10)

where is referred to here as theframe likeli-hood ratio. Equation (10) provides a way to define a one-pass

130 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000

search algorithm for recognition/verification. Following thesame inductive logic that is used in the Viterbi algorithm, wecan define the quantity as the highest probabilityobtained for a single state sequence through our three-dimen-sional lattice over the first observations and terminating intarget model state and alternate hypothesis model stateJust as in the standard Viterbi algorithm, this probability canbe calculated by induction

(11)

where and are the number of states in the target andalternate hypothesis models, respectively.

This new criterion allows information concerning the alterna-tive hypothesis to be introduced directly in the decoder. How-ever, some constraints on the search space must be imposed inpractice to achieve manageable computational complexity.

D. Efficient LR Decoder

There are two issues that must be addressed if the LR de-coding algorithm of the previous section is to be applicableto actual speech recognition tasks. The first issue is computa-tional complexity. The unconstrained search procedure as spec-ified by (11) implies a search through a dimensionalstate space representing an increase in complexity of a factorof over ML decoding. The following discussion suggeststhat constraints can be applied to the relationship between targethypothesis model and alternate hypothesis model state indices.The result of this discussion is the definition of a more effi-cient decoding algorithm which requires evaluating a far smallernumber of state sequences than would be evaluated in (11). Thesecond issue is the definition of the alternate hypothesis model.There are many hazards associated with using a LR criterionin a decoder. We have to be concerned with the wide dynamicrange that is generally associated with any likelihood ratio. Itis also important that the alternate hypothesis model provide agood representation of imposter events in the decoder. These is-sues are discussed below.

1) LR Decoding Constraints:The implications of somepractical decoding constraints for reducing the computationalcomplexity of the likelihood ratio decoder are discussed here.The following are examples of simple constraints that can beapplied:

Unit Level Constraint — For each unit corresponding toa sub-word or word model, the target and alternate model com-ponents of the three dimensional path can both be constrained tostart at a common unit start time and end at a common unit endtime. This is equivalent to saying that the target model and al-ternate model must occupy their unit initial states and unit finalstates at the same time instants. With this constraint, the uncon-

Fig. 4. An example of how state level constraints, which constrain therelationship between target and alternate hypothesis model state sequences,may be applied to reducing the complexity of the likelihood ratio decoder.

strained state sequence illustrated in Fig. 3 can be segmentedinto concatenated unit level state sequences:

(12)

(13)

where corresponds to the state sequence for unitState Level Constraint — State level constraints can also

be imposed to further restrict the relationship between thetarget and alternate model state sequences for a given unit.This results in a further reduction in the number of candidatesequences that need be evaluated during search. In posing theseconstraints, it is assumed that the topology of HMM models

and remains fixed. However, the network search canbe constrained so that a correspondence exists between thedecoded target model and alternate model state indices,and

at time

(14)

where and are constants which define the dimensions ofthe search window.

Fig. 4 illustrates how the state level constraints of (14) can re-duce the complexity of the search procedure whenEquation (14) states that given a timeand state index

the next state index in the plane mustbe equal or greater than the previous one in the planebut not greater than the state in the plane plus an offset

The same condition remains for the next state in the planebut with an offset of The vertices of the shaded

rectangles in Fig. 4 show the allowed states in the three dimen-sional path when Starting at time in the state(1,1), the allowed transitions are (1,1),(1,2),(2,1) and (2,2). Sup-pose that at time the transition is to the state (2,1), in this

LLEIDA AND ROSE: UTTERANCE VERIFICATION IN CONTINUOUS SPEECH RECOGNITION 131

case the allowed transitions from this state, applying (14), are(2,1), (2,2), and (2,3). In this way, the maximum difference be-tween state index is one.State-level constraints can be definedwhich result in a minimal extra decoder complexity beyond thatof the ML decoder. Suppose that and are HMMs withidentical topologies, and the state constraint is defined so that

for each In this way, the single op-timum state sequence is decoded applying the state constraintto (6)

(15)

As a result, the process of identifying the optimum state se-quence can take the form of a modified Viterbi algorithm wherethe recursion at the frame level is defined as

(16)

The accumulated path score, obtained in the Viterbialgorithm corresponds to a measure of confidence in the pathterminating in state at time Note that (16) is equiva-lent to the standard Viterbi algorithm applied to a modifiedHMM with transition probabilities initialstate probabilities and observation densities

The algorithm of (16) has been usedin the experiments described in Section V for two purposes.The first is to obtain string hypotheses decoded according to alikelihood ratio criterion. The second purpose is to “rescore”utterance segmentations obtained from a maximum likelihooddecoder for the purpose of providing improved confidencemeasures.

2) Definition of Alternative Models:The alternative hypoth-esis model has two roles in utterance verification. The first isto reduce the effect of sources of variability on the confidencemeasure. If the probabilities of the null hypothesis model andthe alternative hypothesis model are similarly affected by somesystematic variation in the observations, then forming the like-lihood ratio should cancel the effects of the variability. Thesecond role of the alternate model is more specifically to rep-resent the incorrectly decoded hypotheses that are frequentlyconfused with a given lexical item. The issues relating to howthe structure of the alternate hypothesis model is defined andhow the parameters of the model are estimated has yet to be ad-dressed. The structure used for the alternate model in this workis motivated here, and the techniques used for estimating alter-nate model parameters are described in Section III.

It is important to note that, since the single-pass decoding pro-cedure produces an output string which maximizes a likelihoodratio, the alternate hypothesis model affects not only the hypoth-esis test performed as part of the utterance verification proce-dure, but also the accuracy of the decoded string. If there is anyportion of the acoustic space that the alternate model does notdescribe well, speech recognition performance can be reduced.As in any hypothesis testing problem, it is difficult to give pre-cise guide lines for choosing the form of the alternate hypoth-esis model when the true distributions are unknown. However,one can say that the following should be true. First, the alternate

hypothesis distribution must somehow “cover” the entire spaceof out-of-vocabulary lexical units. Observation vectors corre-sponding to out-liers of the alternate hypothesis distribution canresult in dramatic swings of the LR which could in turn result indecoding errors. Second, if OOV utterances that are easily con-fused with the vocabulary words are to be detected, the alternatehypothesis model must provide a more detailed representationof the utterances that are likely to be decoded as false alarms forindividual vocabulary words.

One way to satisfy the conditions outline above for the alter-nate model is to define the alternate hypothesis probability as alinear combination of two different models

(17)

where and are linear weights. The purpose ofreferred here as background alternative model, is to provide

a broad representation of the feature space. This broad represen-tation serves to reduce the dynamic range of the likelihood ratio,and allows to satisfy the condition mentioned aboveof “covering” the larger space of input utterances. In the ex-periments described in Section V, a single background alternatehypothesis model is shared amongst all “target” HMM models.The purpose of referred to here as an imposter alternatehypothesis model, is to provide a detailed model of the decisionregions associated with each individual HMM. Imposter modelsare trained to represent individual words or sub-word units thussatisfying the above condition that provides a detailedrepresentation of errors associated with particular words.

E. Confidence Measures

The issue of modeling inaccuracies associated with HMMacoustic models was alluded to in Section I. It was suggestedthere that modeling errors may result in extreme values in locallikelihood ratios which may cause undo influence at the wordor phrase level. In order to minimize these effects, we investi-gated several word level likelihood ratio based confidence mea-sures that can be computed using a non-uniform weighting ofsub-word level confidence measures. These word level mea-sures can be applied to the likelihood ratio scores producedduring the second pass of the two-pass UV procedure in Fig. 1or to the scores obtained directly from the one-pass procedureillustrated in Fig. 2.

Let and be the confidence measure for phoneticunit and word , respectively. A sentence is composedof a sequence of words where is the

th word in and a word is composed of a phonetic base-form where is the th sub-wordunit in A sub-word unit based confidence measure can becomputed from the likelihood ratio between the correct and al-ternative hypothesis models. The unweighted unit level likeli-hood ratio score for a unit decoded over a segment

can be obtained directly from the LR de-coder or computed from the second pass of the two pass decoderas

(18)

132 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000

where is the number of frames in the decoded segment.However, as with any likelihood ratio, can exhibit a widedynamic range which is undesirable if is used to form alarger word level score. One way to limit the dynamic range ofthe sub-word confidence measure is to use a sigmoid function

(19)

where defines the slope of the function andis a shift. Usingthe sigmoid, the range of the sub-word confidence measure iscompressed to fall within the interval [0,1]. While the notation in(19) suggests that a separate function could be defined for eachunit by specifying separate and only a single sigmoidis used for the experiments described Section V.

In this work, it is assumed that the system produces a de-coded word string where a confidence measure is assigned bythe system to each hypothesized word. There are many differentways that word level confidence measures can be formed bycombining sub-word confidence scores. It is important to pre-vent extreme values of the sub-word level likelihood ratio scoresfrom dominating the word level confidence measures. These ex-treme values can often occur when a goodlocal match has oc-curred in the decoding process even when the overall utterancehas been incorrectly recognized. As a result, the best word levelconfidence scores tend to assign reduced weight to those ex-tremes in sub-word level scores.

Word level confidence measures corresponding to the arith-metic and geometric means of both the unweighted unit levellikelihood ratio scores, and the sigmoid weighted unitlevel likelihood ratio scores are compared. The followingmeasures are defined for a word composed of sub-word units

(20)

(21)

(22)

(23)

where and are the arithmetic and geometric means of theunweighted sub-word level confidence scores, andandare the arithmetic and geometric means of the sigmoid weightedsub-word score [9]. It was noted in anecdotal experimentsthat the geometric average tended to have a similar effect as theconfidence measure

(24)

defined in [7]. This corresponds to a weighted average of theinverse of the sub-word confidence scores which approximates

amin for values of and approximates amaxfor values of

III. L IKELIHOOD RATIO-BASED TRAINING

A training technique is described for adjusting the parame-ters and in the likelihood ratio test to maximize a crite-rion that is directly related to (1). The notion behind the methodis that adjusting the model parameters to increase this confi-dence measure on training data will provide a better measureof confidence for verifying word hypotheses during the normalcourse of operation for the service. The goal of the training pro-cedure is to increase the average value of for cor-rect hypotheses and decrease the average value of

for false alarms This section beginsby deriving a log likelihood ratio based cost function for esti-mating model parameters. Update equations for the model pa-rameters obtained by optimizing this cost function using a gra-dient descent procedure are derived. Finally, issues relating tothe implementation of the training procedure are discussed. Inparticular, since the estimated models will be used in the singlepass decoding/verification strategy, we discuss how the estima-tion procedure is based on this one-pass strategy.

A. Likelihood Ratio Based Cost Function

LR based training is a discriminative training algorithm thatis based on a cost function which approximates a log likelihoodratio. In practice, the distance measure which underlies the costfunction is approximated for a hypothesized unitby first ob-taining an optimum sequence of states using the single pass de-coding strategy under the assumptions used in obtaining (16).A frame based distance defined for each state transition in thesequence is given by

(25)

The segment based distance is obtained by averaging the framebased distances as

(26)

where and are the initial and final frame of thespeech segment decoded as unit over the segment

A gradient descent procedure is used to iteratively adjust themodel parameters as new utterances are presented to the trainingprocedure [14]. A smooth cost function corresponding to theunit level distance in (26) that is differentiable with respect toour model parameters is required [7]. Define the cost function

for unit using a sigmoid function

(27)

where the indicator function is defined as

and defines the slope of the function and is a shift. Thediscriminative training procedure based on the cost function in

LLEIDA AND ROSE: UTTERANCE VERIFICATION IN CONTINUOUS SPEECH RECOGNITION 133

(27) is implemented in four steps. First, one or more unit hy-potheses are generated along with the associated unit endpointsby the single-pass decoder. Second, the unit hypothesis decodedfor an utterance is labeled as corresponding to an actual occur-rence of a vocabulary word (true hit) or a false alarm. Third,the cost function given in (27) is computed using the probabil-ities estimated from the target and alternate hypothesis models.Finally, a gradient update is performed on the expected cost

(28)

where is the model computed at thethupdate of the gradient descent procedure, andis a learning rateconstant.

The gradient in (28) is taken with respect to the model param-eters that are to be re-estimated. In the experiments describedin Section V, continuous mixture observation density HMM’sare used in recognition. The set of re-estimated parameters in-cludes Gaussian means, variances, and mixture weights for boththe target and alternate hypothesis HMMs. The expectation in(28) is approximated as the average cost computed over all oc-currences of the unit in the training data

(29)

where is the number of occurrences of the unitin thetraining set. The average cost function is asoft count of thenumber ofType I and Type II errors assuming that the deci-sion threshold is Imposters with scores greater thanτ (typeII ) and targets with scores lower thatτ (type I) tend to increasethe average cost function. Therefore, if we minimize this func-tion we can reduce the misclassification between targets and im-posters. The average cost function is minimized by the trainingprocedure which means that the average sub-word confidencemeasure is increased for correct hypotheses and is decreased forfalse alarms of the unit

It is well known that the gradient of the cost function withrespect to the segment level score, is cen-tered about As a result, only speech segments with segmentlevel scores in the vicinity ofτ will have a significant effect onupdating parameters in the gradient update procedure. For seg-ments that are very far from the decision boundary, the magni-tude of the gradient will be very small resulting in a negligibleimpact on the re-estimated parameters.

B. HMM Parameter Update Equations

The gradient update equations are derived for the likelihoodratio training procedure outlined in Section III-A. We begin bywriting the expression for the gradient of the expected cost givenin (29) with respect to a generic model parameter where

refers to the th HMM parameter to be updated (mixtureweights,means and variances), and the indexrefers to target

or alternative hypothesis HMM models. Then,update equations will be derived for each individual model pa-rameter that is associated with the observation densities for boththe target and alternate hypothesis HMM models. Given a seg-

ment of speech the gradient of the cost func-tion with respect to an HMM parameter can be written as

(30)

The indicator function dictates the direction of thegradient depending on the whether the decoded observationswere correctly or incorrectly decoded and whether the re-esti-mated parameter is associated with the target or the alternatehypothesis model

(31)

It will be assumed here that the target and alternate hypoth-esis models are both continuous mixture observation densityHMMs. Hence, the observation densities in (30) representthe complete set of parameters to be re-estimated in the gradientupdate procedure

(32)

where

(33)

is a mixture of Gaussian densities. In (32), is the numberof states and the th Gaussian mixture component is character-ized by a mean vector a diagonal covariance matrixand a mixture weight There are a variety of gradient basedHMM model reestimation procedures in the literature, and thereestimation equations that follow are similar to those derived aspart of discriminative training procedures [2]–[4]. The gradientis taken with respect to the transformed parameters, and

and the mean values, where the transformations are

(34)

(35)

Hence, the partial derivatives needed in (30)can be obtained by inserting (33) into (30).

(36)

(37)

(38)

134 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000

where

(39)

The form of the alternate hypothesis densities was motivatedin Section II-D2. Following the example of (17) where the al-ternate hypothesis distribution was expressed in terms of botha unit specific imposter distribution and a more general back-ground distribution, we derive update equations for the linearweights that appear in

(40)

Without loss of generality, both of the weights in (40),and could be state dependent parameters. To computethe gradient of the observation probabilities with respect to theweights, they must transformed to satisfy the stochastic con-straint The transformations are

(41)

(42)

The gradient of can then be obtained with respect tothe transformed weights as

(43)

(44)

C. Training Procedure

The training procedure involves successive updates to thetarget and alternative hypothesis models, asgiven in (28) based on the estimation of the gradient of the ex-pected cost. This subsection describes how this gradient updateprocedure is initialized by obtaining initial ML estimates of themodel parameters. It also describes how the decoder is invokedat each iteration of the procedure to provide segmentation infor-mation for the computation of the underlying cost functions.

Initial ML estimates of the target hypothesis HMM models,are obtained using the segmental k-means training algo-

rithm. The procedure that is followed here for obtaining theinitial estimates of the alternative hypothesis model, is de-scribed as follows. First, ML training of the background model,

is performed using all of the utterances in the training set.Second, hypothesized segments along with their segmentationboundaries are obtained using the modified Viterbi algorithmgiven in (28) with serving as the initial alternative hypoth-esis HMM model. Finally, initial ML estimates of the impostermodels, are trained from the decoded sub-word units thatwere labeled as being insertions and substitutions in the trainingdata.

The complete likelihood ratio based training procedure canbe outlined as follows.

1) Train initial maximum likelihood HMMs, and foreach unit.

2) For each iteration over the training data base:• For each utterance:

Obtain hypothesized sub-word unit string andsegmentation using the LR decoderAlign the decoded unit string with the correctstring.Label each decoded sub-word unit as correct orfalse alarm, to obtain indicator functionin (31).Update gradient of expected cost,

• Update the model parameters, in (28).Of course, the motivation for using the LR decoder in thetraining procedure is so that the same optimization criterion isused in both parameter reestimation and decoding. Anecdotalexperiments have suggested that better performance is achievedusing the LR decoder in training over using an ML decoderprovided that a reasonable model initialization strategy isused. It is very difficult, however, to make stronger statementsbeyond this empirical evidence as to the “optimality” of usingthe LR decoder in training.

IV. SPEECHCORPORA: MOVIE LOCATOR TASK

The decoding and training procedures that were describedin Sections II and III were evaluated on a limited domaincontinuous speech recognition task. The task was a trial ofan ASR based service which accepted spoken queries fromcustomers over the telephone concerning movies playing atlocal theaters [21]. This task, referred to as the “movie locator”,was interesting for evaluating utterance verification techniquesbecause it contained both a large number of ill-formed ut-terances and also contained many out-of-vocabulary words.This section briefly describes the movie locator task and theevaluation speech corpus that was derived from it.

The following carrier phrases describe the general classesof queries that are accepted. The italicized words representsemantic class labels which represent anywhere from severaldozen to several hundred values that can change over time. Thesystem was also configured to accept queries that contain addi-tional qualifiers such as time restrictions, e.g., “this afternoon”,and spontaneous speech, e.g., “um I would like to know”. Theexample carrier phrases are as follows.

1) What’s playing at thetheater_name?2) What’s playing nearcity?3) Where ismovie_title playing nearcity?4) When ismovie_title playing at thetheater_name?5) When ismovie_title playing nearcity?6) Whatmovie_categoryare playing at thetheater_name?7) Whatmovie_categoryare playing nearcity?8) Where istheater_namelocated?9) Where ismovie_title playing at achain_namenearcity?

An example of complex query accepted by the service is “Yes,I would like to find out what movie is being shown at, um, theLake Theater in Oak Park, Illinois.”

LLEIDA AND ROSE: UTTERANCE VERIFICATION IN CONTINUOUS SPEECH RECOGNITION 135

In a trial of the system over the public switched telephone net-work, the service was configured to accept approximately 105theater names, 135 city names, and between 75 and 100 cur-rent movie titles. A corpus of 4777 spontaneous spoken utter-ances from the trial were used in our evaluation. A total of 3025sentences were used for training acoustic models and 752 utter-ances were used for testing. The sub-word models used in therecognizer consisted of 43 context independent units. Recog-nition was performed using a finite state grammar built fromthe specification of the service, with a lexicon of 570 differentwords. The total number of words in the test set was 4864,where 134 of them were out-of-vocabulary. There were 275 sen-tences in the test set that could not be parsed by the finite stategrammar. Of these 275 sentences, 85 of contained out-of-vo-cabulary words and the rest were not explained by the grammar.Recognition performance of 94.0% word accuracy was obtainedon the “in-grammar” utterances. While the results described inSection V are presented primarily in terms of word verificationperformance, it is clear from the data that the ability to rejectinvalid queries is also important. The feature set used for recog-nition included 12 mel-cepstrum, 12 delta mel-cepstrum, 12delta-delta mel-cepstrum, energy, delta energy and delta-deltaenergy coefficients. Cepstral mean normalization was appliedto the cepstral coefficients to compensate for linear channel dis-tortions.

V. EXPERIMENTAL RESULTS

An experimental study was performed to evaluate theeffectiveness of both the likelihood ratio decoding proceduredescribed in Section II and the likelihood ratio based trainingprocedure described in Section III. This section describes theresults obtained from this study, and is composed of three parts.First, several methods for computing word level confidencemeasures from local likelihood ratio scores are compared. Areasonable definition of this confidence measure is importantfor both the one-pass and two-pass methods used here forutterance verification. Second, the utterance verification per-formance of the one-pass and two-pass utterance verificationparadigms are compared. We have not attempted to makedirect comparisons between systems that were implementedusing ML based decoding and LR based decoding withoutalso considering model training scenarios. Comparisons aremade, however, between the UV performance for the one-passprocedure and that of the two-pass procedure when HMMmodels are trained using both ML and LR based training.Finally, the effect of the likelihood ratio training procedure onutterance verification performance is investigated.

The experimental evaluation of all decoding procedures,training algorithms, and utterance verification techniques isperformed using the speech corpus derived from the movielocator task described in Section IV. For all the experiments,the speech recognizer is configured using forty-three contextindependent HMM models. Each model contains three stateswith continuous observation densities and a maximum of 32Gaussian mixtures per state. Tri-phone based HMM units werealso evaluated on this task, but provided only a 16% reductionin word error rate. This was thought to be because of therelatively small number of utterances available for training.

Fig. 5. Word detection operating curve using a geometric mean combination(solid line) and arithmetic mean (dash line).

These recognition models correspond to the “target” hypothesismodels discussed in the context of utterance verification inSection II-C. It is important to note that all of the systemsimplemented according to the two-pass UV scenario describedin Fig. 1 use this exact set of recognition HMM models forthe first pass CSR decoder. LR trained target and alternativehypothesis HMM models are only used in the second UV stageof the two-pass procedure.

The alternate hypothesis models are configured according tothe definition given in (17). A single “background” HMM al-ternate hypothesis model, containing three states with 32mixtures per state was used. The background model parameterswere trained using the ML segmental k-means algorithm fromthe entire training corpus. A separate “imposter” alternate hy-pothesis HMM model, was trained for each sub-word unit.These models contained three states with eight mixtures perstate and were initially trained using the ML segmental k-meansalgorithm from insertions and substitutions decoded from thetraining corpus. Subsequent training of all model parametersusing the likelihood ratio training procedure is discussed below.

A. Comparison of UV Measures

Word level utterance verification performance is comparedfor different definitions of the confidence measures given in(20)–(23). All of the results described in this subsection arebased on the two-pass utterance verification paradigm where thehypothesized word strings are passed from the speech recog-nizer to the utterance verification system. Performance is de-scribed both in terms of the receiver operating characteristiccurves (ROC) and curves displaying type I + type II error plottedagainst the decision threshold settings. The motivation for theseperformance criteria were given in Section II-A.

Fig. 5 shows the ROC curves obtained when the word levelconfidence score is formed using an arithmetic average,and a geometric average, of the unweighted sub-wordconfidence scores Since the geometric mean tends toemphasize the smaller valued sub-word level scores, the UVperformance is better than that obtained using the arithmeticmean.

136 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000

Fig. 6. Word detection based receiver operating curves and type I + type II error curves comparing performance of confidence measures using the arithmeticmean,W (w); (dashed line) and the geometric mean,W (w); (solid line) of the sigmoid weighted sub-word level confidence measure.

Fig. 6 shows both the ROC curves and the type I + type IIerror curves when a sigmoidal weighting function is applied tothe sub-word level confidence scores, and the word level confi-dence scores are obtained from the sub-word level scores usingthe arithmetic mean, or the geometric mean, Inboth cases, the sigmoid weighting in (19) is parameterized using

and It is clear from Fig. 6 that whenthe sigmoidal weighting is applied, the ROC curves are similarfor both the arithmetic average and the geometric average. Thissuggests that the sigmoid performs the same function as the geo-metric averaging in limiting the effects of extreme values in thelikelihood ratio scores. However, it appears from the error plotin Fig. 6 that the confidence measure based on the geometricmean is less sensitive to the setting of the confidence threshold.In the remaining simulations described in this section, the geo-metric mean of the sigmoid weighted sub-word level confidencemeasures, will be used.

B. Comparison of One-Pass Decoder with Well-Known UVProcedures

This section compares the one-pass utterance verification anddecoding procedure to other more well known procedures. Theone-pass UV is compared to an often used two-pass UV proce-dure where word level confidence measures are computed usinga dedicated network of phonemes to obtain the alternate hypoth-esis likelihood.

Fig. 7 provides a comparison of the two-pass utterance veri-fication performance for several different definitions of the al-ternate hypothesis model As a point of reference, the firstexample corresponds to the case where nois used, but in-stead the simple duration normalized likelihood score is used:

(46)

where corresponds to the sequence of observation vectorsbeginning at frame and ending at that were decoded forword This is displayed as the dash-dot curve in Fig. 7. Thesecond example corresponds to a “phone-net” likelihood ratioscore which can be defined as

(47)

Fig. 7. Comparison among three UV procedures: one-pass UV (solid line),phone-net (dashed line), and duration normalized likelihood (dot-dash line).

where corresponds to an unconstrained network of the 43context independent phones that are in the network. This is dis-played as the dashed curve in the figure. Finally, the perfor-mance of the one-pass utterance verification using target/alter-native models is given as the solid line in Fig. 7. The target andalternate hypothesis models used by the one-pass UV are up-dated by applying two iterations of the likelihood ratio trainingprocedure described in Section III-C.

The likelihood ratio decoding algorithm in (16) is used forobtaining the optimum word string, and the confidence measure

is computed directly from the scores obtained from the de-coder.

This section has two goals. The first is to investigate theeffects of the likelihood ratio training procedure on UVperformance. The second goal is to investigate the relativeperformance of the one-pass UV procedure with respect to thetwo-pass procedure when identical definitions of the target andalternative hypothesis models are used for both procedures.While the one-pass UV and decoding procedure has clearadvantages over two-pass procedures in terms of simplicity ofcontrol structures and system delay, it still remains to showhow the performance of a one-pass UV system compares tothat of a similarly configured two-pass system. In preliminary

LLEIDA AND ROSE: UTTERANCE VERIFICATION IN CONTINUOUS SPEECH RECOGNITION 137

TABLE IUTTERANCE VERIFICATION

PERFORMANCES: type I + type II MINIMUM ERRORRATES FOR THE

ONE-PASS (OP) AND THE TWO-PASS (TP) UTTERANCE VERIFICATION

PROCEDURES. b%% NUMBER OF MIXTURES FOR THEBACKGROUND

MODEL, i% NUMBER OF MIXTURES FOR THEIMPOSTERMODEL

comparisons, it was determined that the relative performanceof the two procedures was heavily dependent on the criterionused for training the alternate hypothesis HMM models.This observation is made more precise by the experimentalcomparison whose results are summarized in Table I.

It is clear from Fig. 7 that using the absolute likelihoodsgives clearly inferior performance to likelihood ratio based con-fidence measures. This was already well known. The phone-netbased utterance verification procedure performs reasonably wellbut suffers from the computational expense of running an un-constrained network of phones. It is interesting to note that interms of utterance verification performance, recognition perfor-mance and computational efficiency, the combination of likeli-hood ratio training and target/alternative HMM models with theone-pass UV decoding strategy out-performs a fairly commonlyused procedure for UV.

C. Investigation of LR Training and UV Strategies

The entries in Table I correspond to the minimumtype I+ typeII error when computed over all possible settings of the decisionthreshold. An example of this type of plot is given in Fig. 6,and the use of a single minimum value in Table I is dictated bythe need to summarize overall performance so that cross-systemperformance comparisons can be made. The minimumtype I+type II error rates are displayed for the one-pass (OP) and thetwo-pass (TP) strategies. The first row in Table I correspondsto use of the initial maximum likelihood trained HMMs, andthe second and third rows correspond to models trained usingone and two iterations of the likelihood ratio training proce-dure, respectively. The three major columns are each labeledaccording to the number of component densities composing thebackground alternate hypothesis modeland the imposter al-ternate hypothesis model in (17). For example, the columnlabeled as b32.i4 corresponds to a alternate hypothesis modelwith 32 component Gaussians and an imposter model with fourcomponent Gaussians. It is also important to note that all of theresults for teh two-pass procedure are based on a system whichuses the original “recognition HMM models” as indicated inFig. 1 and described in Section IV for the ML based CSR de-coder used in the first pass.

There are three points that can be made from the resultsdisplayed in Table I. The first point is that likelihood ratiotraining is very important for obtaining good performance fromthe one-pass likelihood ratio decoder. This is evident from thesignificant performance improvements listed under “OP” inthe table in going from maximum likelihood training to like-

Fig. 8. Likelihood ratio training, ROC curves for initial models (dash-dot line),one iteration (dash line) and two iterations (solid line). The *-points are theminimum type I+ type II error.

lihood ratio training for all parameterizations of the alternatehypothesis model. It is clear that the different parameterizationsfor the alternative hypothesis models yield a wide range ofperformance after the initial ML training. However, in all of thethree cases shown, the LR training of the UV models results inUV performance that is not only significantly improved but alsoremarkably consistent after only two iterations of LR trainingare applied. The importance of the LR training in the one-passdecoder is also demonstrated by the ROC curves in Fig. 8.Separate curves are plotted to describe the UV performancefor the one-pass decoder using zero, one, and two iterationsof the LR training procedure. It is clear from Fig. 8 that thereis a significant increase in performance over a wide range ofoperating points.

The second point that can be made is the dependence of UVperformance on the parameterization of the background alter-native hypothesis model, The purpose of the backgroundmodel, as mentioned in Section II-D2, is to provide a broad rep-resentation of the feature space, reducing the dynamic rangeof the LR score. For the initial ML UV models, reducing thenumber of mixture densities from 32 to 16 results in a severedegradation in UV performance. Anecdotal obervations suggestthat this is a result of the increased dynamic range caused by thereduced parameterization of The LR training directly com-pensates for the effects of these artifacts that are associated withthe LR, as is shown in Table I.

The third point that can be made from Table I is that theone-pass UV procedure consistently out-performs the two-passprocedure when likelihood ratio training is used. A consistentlevel of performance is maintained over the range of model pa-rameterizations that are displayed in Table I. The ROC curvesin Fig. 9 also provide a comparison of UV performance for theone-pass and two-pass UV verification procedures. The curvesare plotted for a single model parameterization where 32 mix-tures per state are used for the target hypothesis models andeight mixtures per state are used for the alternate hypothesismodels. For a single model parameterization, an improvementof approximately 7% in word detection performance is obtainedfor false alarm rates greater than 5%.

138 IEEE TRANSACTIONS ON SPEECH AND AUDIO PROCESSING, VOL. 8, NO. 2, MARCH 2000

TABLE IISPEECH RECOGNITION PERFORMANCE GIVEN IN TERMS OF WORD ACCURACY WITHOUT USING UTTERANCE

VERIFICATION AND UTTERANCE VERIFICATION PERFORMANCE GIVEN AS THE SUM OF type I AND type II ERROR

Fig. 9. One-pass versus two-pass UV comparison with the b32.i8configuration and two iterations of the likelihood ratio training.

VI. DISCUSSION

There are many questions that still remain concerning the ef-fect of the UV related training and decoding. We briefly ad-dress two of these issues. First, while the LR training procedureswere compared in Section V in terms of utterance verificationperformance, it was not clear whether or not the LR trainingprocedures actually improved speech recognition performance.Second, the performance in Section V was reported for a singletest set which included both in-grammar and out-of-grammarutterances. It was not clear how the various techniques wouldperform when the in-grammar and out-of-grammar utteranceswere considered separately.

In Table II, both speech recognition and UV performanceare given for the test set that includes both in-vocabulary andout-of-vocabulary utterances. The first column of the table dis-plays the word accuracy when no utterance verification is used,and the second column of the table displays the sum of the typeI and type II error which is used to describe UV performance.The first row of Table II gives the performance with no like-lihood ratio training and the second row displays performanceafter two iterations of likelihood ratio training. First, it is inter-esting to compare the word accuracy of 84.7% obtained for thecombined in-vocabulary and out-of-vocabulary utterances withthe 94.0% word accuracy reported in Section IV for the in-vo-cabulary utterances alone. Second, it is interesting to note fromTable II that the word accuracy significantly improves after twoiterations of LR training although not to the same degree as theUV performance.

In Figs. 10 and 11 families of ROC curves are used todescribe UV performance as measured over in-grammar andout-of-grammar utterances, respectively. In each of the figures,

Fig. 10. In grammar sentences. Initial models: dot-dash line, one iteration:dash line and two iterations: solid line.

Fig. 11. Out-of-grammar sentences. Initial models: dot-dash line, oneiteration: dash line and two iterations: solid line.

separate curves are used for systems configured using zero, one,and two iterations of LR training. It is clear from Figs. 10 and11 that performance degrades dramatically for out-of-grammarutterances. We believe that the problem with out-of-grammarutterances is particularly severe in this task because of the rela-tively low perplexity of the language model as defined for thistask. It is also clear that the effect of the LR training procedureis much more pronounced for the in-grammar utterances.

VII. SUMMARY AND CONCLUSIONS

We have presented novel decoding and training proceduresfor HMM utterance verification and recognition based on

LLEIDA AND ROSE: UTTERANCE VERIFICATION IN CONTINUOUS SPEECH RECOGNITION 139

the optimization of a likelihood ratio based criterion. Thedecoding algorithm allows word hypothesis detection andverification to be performed simultaneously in a “one-pass”procedure by defining the optimum state sequence to be thatwhich maximizes a likelihood ratio, as opposed to a maximumlikehood criterion. Associated with this decoding algorithm,a training procedure has been presented which optimizes thesame likelihood ratio criterion that is used in the decoder. As aresult, both training and recognition are performed under thesame optimization criterion as is used for hypothesis testing.While a very general search algorithm has been presented,a more efficient algorithm taking the form of a modifiedViterbi decoder is derived as a special case of this generalprocedure. Finally, a word confidence measure based on thesame optimization function used in training has been proposed.

Experimental studies were performed using utterances col-lected from a trial of a limited domain ASR based service whichaccepted continuous utterances from customers as queries overthe public switched telephone network. It was observed on thistask that the likelihood ratio training procedure could improveutterance verification performance by an amount ranging from11% to 19% depending on model parameterization. It was alsofound that, when combined with likelihood ratio training, theone-pass decoding procedure improved UV performance overthe more traditional two pass approach by as much as 11%.More recently, likelihood ratio training and decoding has alsobeen successfully applied to other tasks including speaker de-pendent voice label recognition [15]. Further research shouldinvolve the investigation of decoding and training paradigms forutterance verification that incorporate additional, non-acousticsources of knowledge.

REFERENCES

[1] J. M. Boite, H. Bourlard, B. D’hoore, and M. Haesen, “A new approachto keyword spotting,” inProc. Eur. Conf. Speech Communications, Sept.1993.

[2] J. S. Bridle, “Alpha-neets: A recurrent neural network architecture witha hidden Markov model interpretation,”Speech Commun., vol. 9, no. 1,pp. 83–92, 1990.

[3] J. K. Chan and F. K. Soong, “An N-best candidates-based discriminativetraining for speech recognition applications,”IEEE Trans. Speech AudioProcessing, vol. 2, pp. 206–216, Jan. 1994.

[4] W. Chou, B. H. Juang, and C. H. Lee, “Segmental GPD training of HMMbased speech recognizer,” inProc. Int. Conf. Acoust., Speech, SignalProcessing, Apr. 1992, pp. 473–476.

[5] S. Cox and R. C. Rose, “Confidence measures for the Switchboard data-base,” inProc. Int. Conf. Acoust., Speech, Signal Processing, May 1996.

[6] K. Fukunaga,Introduction to Statistical Pattern Recognition. NewYork: Academic, 1990.

[7] B. H. Juang and S. Katagiri, “Discriminative learning for minimum errorclassification,”IEEE Trans. Signal Processing, vol. 40, pp. 3043–3054,Dec. 1992.

[8] E. Lleida and R. C. Rose, “Efficient decoding and training procedurs forutterance verification in continuous speech recognition,” inProc. Int.Conf. Acoust., Speech, Signal Processing, May 1996.

[9] , “Likelihood ratio decoding and confidence measures for contin-uous speech recognition,” inProc. Int. Conf. Spoken Language Pro-cessing, Oct. 1996.

[10] L. R. Rabiner, “A tutorial on hidden Markov models and selected appli-cations in speech recognition,”Proc. IEEE, vol. 77, pp. 257–286, 1989.

[11] M. G. Rahim, C. H. Lee, and B. H. Juang, “Robust utterance verificationfor connected digits recognition,” inProc. Int. Conf. Acoust., Speech,Signal Processing, May 1995, pp. 285–288.

[12] M. Rahim, C. Lee, B. Juang, and W. Chou, “Discriminative utteranceverification using minimum string verification error training,” inProc.Int. Conf. Acoust., Speech, Signal Processing, May 1996.

[13] R. C. Rose, “Discriminant word spotting techniques for rejectingnon-vocabulary utterances in unconstrained speech,” inProc. Int. Conf.Acoust., Speech, Signal Processing, Mar. 1992.

[14] R. C. Rose, B. H. Juang, and C. H. Lee, “A training procedure for veri-fying string hypotheses in continuous speech recognition,” inProc. Int.Conf. Acoust., Speech, Signal Processing, Apr. 1995, pp. 281–284.

[15] R. C. Rose, E. Lleida, G. W. Erhart, and R. V. Grubbe, “A user-config-urable system for voice label recognition,” inProc. Int. Conf. SpokenLanguage Processing, Oct. 1996.

[16] R. A. Sukkar and J. G. Wilpon, “A two pass classifier for utterance re-jection in keyword spotting,” inProc. Int. Conf. Acoust., Speech, SignalProcessing, Apr. 1993.

[17] M. Rahim, C. Lee, B. Juang, and W. Chou, “Utterance verification ofkeyword strings using word-based minimum verification error training,”in Proc. Int. Conf. Acoust., Speech, Signal Processing, May 1996, pp.518–521.

[18] C. Torre and A. Acero, “Discriminative training of garbage model fornon-vocabulary utterance rejection,” inProc. Int. Conf. Spoken Lan-guage Processing, June 1994.

[19] J. G. Wilpon, L. R. Rabiner, C. H. Lee, and E. R. Goldman, “Automaticrecognition of keywords in unconstrained speech using hidden Markovmodels,”IEEE Trans. Acoust., Speech, Signal Processing, vol. 38, pp.1870–1878, Nov. 1990.

[20] S. R. Young and W. H. Ward, “Recognition confidence measures forspontaneous spoken dialog,” inProc. Eur. Conf. Speech Communica-tions, Sept. 1993, pp. 1177–1179.

[21] J. J. Wisowaty, “Continuous speech interface for a movie locator ser-vice,” in Proc. Human Factors Ergon. Soc., 1995.

Eduardo Lleida (M’99) was born in Spain in 1961.He received the M.Sc. degree in telecommunicationengineering and the Ph.D. degree in signal processingfrom the Universitat Politècnica de Catalunya (UPC),Spain, in 1985 and 1990, respectively.

From 1989 to 1990 he was an Assistant Professorand from 1991 to 1993, he was an AssociateProfessor with the Department of Signal Theoryand Communications, UPC. From February 1995to January 1996, he was a Consultant in SpeechRecognition with AT&T Bell Laboratories, Murray

Hill, NJ. Currently, he is an Associate Professor of signal theory and communi-cations with the Department of Electronic Engineering and Communications,Centro Politécnico Superior, Universidad de Zaragoza, Spain. His currentresearch interests are in signal processing, particularly as applied to speechenhancement, speech recognition, and keyword spotting.

Richard C. Rose(S’84–M’88) received the B.S. andM.S. degrees in electrical engineering from the Uni-versity of Illinois, Urbana, in 1979 and 1981, respec-tively, and the Ph.D. degree from the Georgia Insti-tute of Technology, Atlanta, in 1988, completing hisdissertation work in speech coding and analysis.

From 1980 to 1984, he was with Bell Laborato-ries, Murray Hill, NJ, where he worked on signal pro-cessing and digital switching systems. From 1988 to1992, he was a member of the Speech Systems andTechnology Group at Lincoln Laboratory, Massachu-

setts Institute of Technology, Cambridge, working on speech recognition andspeaker recognition. He is presently a Principal Member of Technical Staff,Speech and Image Processing Services Laboratory, AT&T Laboratories—Re-search, Florham Park, NJ.

Dr. Rose served as a member of the IEEE Signal Processing Technical Com-mittee on Digital Signal Processing from 1990 to 1995, and has served as an Ad-junct Faculty Member of the Georgia Institute of Technology. He was electedas an At Large Member of the Board of Governers for the Signal ProcessingSociety for the period from 1995 to 1998, and served as membership coordi-nator during that time. He is currently serving as an associate editor for theIEEE TRANSACTIONS ONSPEECH ANDAUDIO PROCESSING, and is also currentlya member of the IEEE Signal Processing Technical Committee on Speech. Heis a member of Tau Beta Pi, Eta Kappa Nu, and Phi Kappa Phi.


Recommended