+ All Categories
Home > Technology > 129966864599036360[1]

129966864599036360[1]

Date post: 01-Nov-2014
Category:
Upload: -
View: 182 times
Download: 0 times
Share this document with a friend
Description:
 
Popular Tags:
33
1 Improvement of noisy speech recognition using a proportional alignment decoding algorithm in the training phase Wei-Wen Hung Department of Electrical Engineering Ming Chi Institute of Technology Taishan, Taiwan, 243 ROC E-mail : [email protected] FAX : 886-02-2903-6852 Tel. : 886-02-2906-0379 and Hsiao-Chuan Wang Department of Electrical Engineering National Tsing Hua University Hsinchu, Taiwan, 30043 ROC E-mail : [email protected] FAX : 886-03-571-5971 Tel. : 886-03-574-2587 Corresponding author : Hsiao-Chuan Wang
Transcript
Page 1: 129966864599036360[1]

1

Improvement of noisy speech recognition using a proportional alignment decoding algorithm in the training phase

Wei-Wen Hung

Department of Electrical Engineering Ming Chi Institute of Technology Taishan, Taiwan, 243 ROC

E-mail : [email protected] FAX : 886-02-2903-6852 Tel. : 886-02-2906-0379

and

Hsiao-Chuan Wang

Department of Electrical Engineering National Tsing Hua University

Hsinchu, Taiwan, 30043 ROC E-mail : [email protected]

FAX : 886-03-571-5971 Tel. : 886-03-574-2587

Corresponding author: Hsiao-Chuan Wang

Page 2: 129966864599036360[1]

2

Improvement of noisy speech recognition using a proportional alignment decoding algorithm in the training phase

Wei-Wen Hung and Hsiao-Chuan Wang

Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, 30043, ROC

Abstract

Modeling the state duration of HMMs can effectively improve the accuracy in decoding the state sequence

of an utterance and result in an improvement of speech recognition accuracy. However, when a speech signal

is contaminated by ambient noise, the decoded state sequence may be distorted. It may stay at some states

too long or too short even with the help of state duration models. This paper presents a proportional

alignment decoding (PAD) algorithm for re-training the hidden Markov models (HMMs). A task of

multi-speaker isolated Mandarin digit recognition was conducted to demonstrate the effectiveness and

robustness of the PAD-based variable duration hidden Markov model (VDHMM/PAD) method.

Experimental results show that the discriminativity of VDHMM/PAD in noisy environment has been

significantly enhanced. Moreover, the proposed method outperforms those widely used state duration

modeling methods, such as using Poisson, gamma, Gaussian, bounded and non-parametric probability density

functions.

This research has been partially sponsored by the National Science Council, Taiwan,

ROC, under contract number NSC-85-2221-E-007-005.

Page 3: 129966864599036360[1]

3

1. Introduction

Hidden Markov model (HMM) is a well-known and widely used statistical approach to speech

recognition. This method provides a powerful framework for modeling the time-varying speech signals.

One of the advantages of HMM is that it enables us to well characterize speech signals as a parametric

stochastic process, and the parameters of this stochastic process can be optimized by the

estimation-maximization (EM) algorithm. In addition, the quality of HMM can also be significantly

improved by incorporating the information of state duration (Rabiner, 1989). In a conventional hidden

Markov model, the probability of staying in state i for d frames is modeled

byp d a ai iid

ii() ( ) ( )= ⋅ −−1 1 , where aii is the state transition probability from state i to itself and

( )1−aii from state i to other states. This inherent temporal characteristic implies that the state duration

in a conventional HMM is exponentially distributed. It does not adequately model the temporal structures

of different acoustic regions in a speech signal (Juang et al., 1985, Rabiner et al., 1985 & Rabiner et al.,

1988). In order to cope with this deficiency, some modeling methods for state duration and word

duration have been proposed.

A. Bonafonte et al., (Bonafonte et al., 1996) used a Markov chain to model the occupancy of the

HMM states, and the parameters of the Markov chain were estimated directly from the duration data. To

reduce the insertion error rate in connected digit recognition, K. Power proposed an expanded-state

duration model (Power, 1996). In this approach, each individual state was expanded by multiple

sub-states, each sharing the original state observation probability density function (pdf). Moreover, K.

Laurila noticed that duration constraints applied only to the recognition phase is quite loose and not

effective enough. Therefore, a state duration constrained maximum likelihood (SDML) training scheme

(Laurila, 1997) was presented to gradually tighten the duration constraints in a hidden Markov model.

Duration modeling technique is not only applied to state level, but can also be extended to word level.

Page 4: 129966864599036360[1]

4

David Burshtein (Burshtein, 1995) used explicit models of state and word durations to reduce the string

error rate for connected digit recognition task. In general, no matter what kind of duration modeling

mechanism is employed, the probability density function for modeling state duration distributions can be

roughly classified into two categories (Gu et al., 1991), non-parametric and parametric methods. For the

non-parametric method, the distribution of state duration is directly estimated from the training data. Thus,

we can obtain a more accurate duration distribution for each state in a word model. However, this

approach needs a large amount of training utterances in order to reach to a desired degree of accuracy.

Moreover, it also requires a considerable amount of memory space for the storage of all the duration

distributions. On the other hand, for the parametric method, some specific probability density functions,

such as Poisson (Russell et al., 1985 & Russell et al., 1987), gamma (Levinson, 1986 & Burshtein,

1995), Gaussian (Rabiner, 1989 & Burshtein, 1995) and bounded density functions (Gu et al., 1991,

Kim et al., 1994, Vaseghi, 1995, Power, 1996 & Laurila, 1997) were used to model the state duration

distributions explicitly, and by which only a few parameters were required to completely specify its

distribution. It is intuitive that there are some drawbacks in the use of parametric approach. One is that

the assumed probability density function may not always fit to the real duration distribution of each state in

a hidden Markov model.

In spite of ambient noises, most of the researches in modeling duration distributions dealt with the

minimization of recognition errors which are mainly attributed to unrealistically modeling for duration

distributions. How to make a duration model more robust to noise contamination is still a problem to be

solved. In this paper, we focus our attention on the robustness of modeling state duration in noisy

environment and neglect the modeling for word duration. This is due to the fact (Burshtein, 1995) that the

state duration modeling is the major contribution to the improvement of recognition rate.

In Section 2, some methods of state duration modeling are reviewed. Then, a series of experiments

Page 5: 129966864599036360[1]

5

were conducted to compare those methods in Section 3. Also, the behaviors of various duration models

under the influence of noise contamination are also investigated. In Section 4, based on the results

obtained in the previous section, we propose a new method that combines a proportional alignment

decoding (PAD) algorithm with state duration distributions to re-train a conventional hidden Markov

model. This is so-called the variable duration hidden Markov model and denoted as VDHMM/PAD.

The state duration distributions of VDHMM/PAD are proved to be more robust than those of other

methods in noisy environment. An experiment of multi-speaker isolated Mandarin digit recognition was

conducted in Section 5 to evaluate the effectiveness and robustness of the proposed method. Finally, a

conclusion is given in Section 6.

2. Overview of state duration modeling methods

When the statistics of state duration is incorporated into both training and recognition phases of a

conventional hidden Markov model, this will result in a variable duration hidden Markov model

(VDHMM) (Levinson, 1986 & Rabiner, 1989). In VDHMM, the likelihood function is defined in terms

of modified forward likelihood and backward likelihood. Let O oo oT= 1 2... be the observation

sequence. The modified forward likelihood αtw j(,) and backward likelihood βtw j(,) are defined

as (Levinson, 1986, Rabiner, 1989 & Hung et al., 1997)

α λt t tw j poo o q j w( ,) ( ..., ( ))= =1 2

= ⋅∑ ∑ −= ≠d

td wiji ij

S

w iaw

α (,) ,,1

⋅ ⋅ − +=

∏p d b ow j w j td

d

, ,() ( )ττ 1

(1)

and

β λt t t T tw i po o o q i w( ,) ( ..., ( ))= =+ +1 2

Page 6: 129966864599036360[1]

6

= ⋅= ≠∑∑ a p dw ij w jj ji

S

d

w

, ,,

()1

⋅ ⋅+ +=

∏b o w jw j t td

d

,( ) (,).ττ

β1

(2)

where λ( )w denotes the variable duration hidden Markov model for word w with Sw states, qt

the present state at time t , aw ij, the state-transition probability from state i to state j of word

model λ( )w , b ow j t,( ) the symbol distribution of ot in the j -th state of word model λ( )w , and

p dw j,() the j -th state duration pdf of word model λ( )w with duration length of d frames. Then,

given a variable duration hidden Markov model, λ( )w , the likelihood function of an observation

sequence,O, can be modeled as

pO w( ( ))λ = αtdd

Dw j

j ji

S

i

S

wij w jw ia p dww

−== ≠=

⋅ ⋅∑∑∑ (,) ()(,)

,, ,

111

⋅ ⋅− +=

∏b o w jw j td t

d

,( ) (,),ττ

β1

(3)

where Dw j(,) indicates the allowable maximum duration length within the j-th state of word model

λ( )w . Based on above definition, the derivation of re-estimation formulas for the variable duration

HMM is formally identical to those for the conventional HMM (Levinson, 1986 & Rabiner, 1989). For a

left to right variable duration HMM without jumps, the maximum likelihood, pO w( ( ))λ , can be

effectively calculated by a three-dimensional (time, state, duration) Viterbi decoding algorithm which is

derived from the literature proposed by Gu et al., (Gu et al., 1991) and can be summarized as follows :

ford=1

ψtw j(,,)1=max (, ,~)log[(

~)]~ ,

dt w jw j d p dψ − −− +1 11

+ +−log[ ]log[( )],,( ) ,a b ow j j w j t1 (4)

ford≥2

ψ ψt t w j tw jd w jd b o(,,) (,, )log[( )],,= − +−1 1 (5)

and

Page 7: 129966864599036360[1]

7

pO w( ( ))λ = max (, ,)log[()],,d

T

T w w SwS d p dw=

+1

ψ (6)

where ψ t w j d( , , ) represents the maximum likelihood of proceeding from state 1 to state j − 1along a

state sequence of duration length ( )t d− frames and producing the observations oo otd1 2...,− and

then staying at the state j and producing the observations o o otd t t− + −1 1... at this state. From above

description, we can find that the success of modeling state duration distributions will promote the

performance of a HMM-based speech recognizer.

In general, the modeling methods for state duration can be classified into two categories, i.e.,

non-parametric and parametric modeling methods.

2.1 Non-parametric state duration modeling method

In non-parametric approaches (Juang et al., 1985, Rabiner et al., 1985, Rabiner et al., 1988,

Anastasakos et al., 1995 & Hung et al., 1997), the probabilities, p dw j,(), used for describing state

duration distributions are estimated via a direct counting procedure on the training data. Let dw jt,, be

the duration of state j in the maximum likelihood state sequence of the t-th training utterance for the word

model λ( )w , and Nw be the total number of training utterances of the word w. Then, the

probabilities,p dw j,(), can be estimated by

p d

d

Nw j

d w jtt

N

w

w

,

,,

()

( )

= =∑Θ1 for d≥1, (7)

where Θd w jtd( ),, is a binary characteristic function and defined as

Θd w jtw jtd

ifd d

otherwise( )

, ,

, .,,

,,==

1

0 (8)

In this non-parametric approach, the accuracy of duration model depends on the amount of training

Page 8: 129966864599036360[1]

8

data. When the amount of training data is sufficient, this modeling method can well approximate the

temporal characteristic of each state in a hidden Markov model. However, large number of parameters to

be stored is one of the drawbacks. A non-parametric approach for isolated Mandarin digit recognition

proposed by Hung et al., (Hung et al., 1997) had shown that the recognition rates were significantly

improved as comparing with those of conventional HMM under the influence of white noise. The

recognition rates are improved from 48.8% in baseline HMM to 62.0% in non-parametric approach

when the signal is contaminated with white noise at SNR equal to 20dB.

2.2 Parametric state duration modeling methods

In parametric approaches, some specific probability density functions have been proposed to model

the distribution of state duration explicitly. The parametric approach has the advantage that only few

parameters are required to completely specify its probability density function. Thus, comparing with the

non-parametric approaches, the memory space for the parametric approach can be significantly reduced.

One of the drawbacks in using parametric duration modeling methods is that the assumed probability

density function may not always match with the state duration distribution of each state in a hidden

Markov model. Some probability density functions including Poisson, gamma, bounded and Gaussian

duration density functions have been proposed to model the distribution of state duration. Detailed

formulations of those duration modeling methods are described as follows.

2.2.1 Poisson distribution for state duration

To characterize the duration property more effectively, M. J. Russell (Russell et al., 1985 & Russell et

al., 1987) replaced the self-transition probability in conventional HMM by a Poisson duration density

function so that there was no self-transition from a state back to itself. This is the so-called hidden

Page 9: 129966864599036360[1]

9

semi-Markov model (HSMM). The hidden semi-Markov model with Poisson distributed state duration is

thought to has some advantages. First, the Poisson probability density function represents a plausible

model for state duration. Second, only one parameter, i.e., the state duration mean, is needed to specify

the distribution of state duration. Third, maximum likelihood estimation of the state duration mean can be

accomplished by using the methods which are analogous to the standard Baum-Welch re-estimation

process.

When the distribution of state duration is modeled by a Poisson density function, it is expressed as

p dd

dew j

w jd

dw j,

,()

( )

( )!,=

−⋅

−−

1

1 for d≥1, (9)

where dw j, denotes the duration mean of j-th state in the word model λ( )w . For comparison, hidden

Markov model (HMM), dynamic time-warping (DTW) and the hidden semi-Markov model (HSMM)

with Poisson distributed state duration were applied to the task of speaker dependent isolated word

recognition (Russell et al., 1985). Experimental results for the third set of recordings showed that error

rate of HSMM is 11.8% and 6.3% lower than those of HMM and DTW, respectively.

2.2.2 Gamma distribution for state duration

In the literature proposed by Levinson (Levinson, 1986), the author first used a family of gamma

probability density functions to characterize the distribution of state duration and formed a continuously

variable duration hidden Markov model (CVDHMM). The gamma distribution was considered to be

ideally suited to the specification of duration density function since it assigns zero probability to negative

duration lengths and only two parameters, state duration mean and variance, are required to specify its

distribution. Moreover, David Burshtein (Burshtein, 1995) proposed a modified Viterbi decoding

algorithm that incorporates both state and word duration models for connected digit string recognition. In

Page 10: 129966864599036360[1]

10

this approach, a duration penalty based on gamma density function is considered at each frame transition.

The modified Viterbi decoding algorithm was proved to have essentially the same computational

requirements as the conventional Viterbi algorithm. The experimental results showed that the modified

Viterbi decoding algorithm with gamma duration distribution reduced the string error rate from 4.77% to

2.86% for the case of unknown string length, and from 2.20% to 1.60% for the case of known string

length as compared with the baseline HMM. The gamma duration density function can be formulated as

p d d ew jw j

w j

dw j

w j w j

,,

,

( )()( )

( )

,

, ,= ⋅ ⋅− − ⋅ξγ

γγ ξ

Γ1 for d≥1 (10)

and

γw jw j w j

w j

d d,

, ,

,

=⋅

∇ , ξw j

w j

w j

d,

,

,

=∇

, (11)

where dw j, and ∇w j, are the duration mean and variance of j-th state in the word model λ( )w ,

respectively. Γ()z is a gamma function defined by

Γ()z x e dxz x= ⋅− −∞

∫ 1

0 for z>0. (12)

2.2.3 Bounded state duration

Due to the characteristic of continuous probability density function, both Poisson and gamma functions

have the advantage of operating well when a relatively small number of training utterances is available.

However, in some situations, there exists the possibility that duration length of some states will be too

long or too short. To avoid those unexpected duration and minimize the erroneous match between testing

utterances and reference models, H. Y. Gu et al. (Gu et al., 1991) proposed a hidden Markov model

with bounded state duration in which the allowable state duration is constrained by some boundaries. The

duration length of each state in this approach is simply bounded by lower and upper bounds in the

Page 11: 129966864599036360[1]

11

recognition phase. The probability density function for bounded state duration is modeled by

p d D DifD d D

otherwisew j w j

upperw jlower w j

lowerw jupper

, , ,, ,

(), ,

, ,

= − +≤ ≤

1

1

0

(13)

where Dw jlower, and Dw j

upper, are the lower and upper bounds of the state duration for state j of the word

model λ( )w , and can be estimated by

D dw jlower

t

N

w jt

w

, ,,min==1

(14)

and

D dw jupper

t

N

w jt

w

, ,,max==1

. (15)

A series of experiments using all the 408 highly confused first-tone Mandarin syllables (Gu et al., 1991)

were conducted to evaluate the effectiveness of HMM with bounded state duration (BSD). In the

discrete case, the recognition rate of HMM with BSD is 78.5%. This is 9.0%, 6.3% and 1.9% higher

than the conventional HMM’s, HMM’s with Poisson and HMM’s with gamma distributed state duration,

respectively. In the continuous case, the recognition rate of HMM with BSD is 88.3%. This is 6.3%,

5.9% and 3.1% higher than those of conventional HMM, HMM with Poisson and HMM with gamma

distributed state duration, respectively. Similar applications of bounded state duration distribution for

speech recognition can be found in the literature by Kim et al., (Kim et al., 1994), Vaseghi, (Vaseghi,

1995) and Power (Power, 1996). The minimum and maximum durations for each state were estimated in

the training phase. Those loose state duration constraints were then used in the final recognition phase. To

tighten those duration constraints, K. Laurila (Laurila, 1997) employed bounded state duration model in

both the training and recognition phases to achieve higher consistency in state duration constraints.

2.2.4 Gaussian distribution for state duration

Page 12: 129966864599036360[1]

12

A parametric approach using Gaussian probability density function for modeling the state duration

distributions is suggested by Rabiner (Rabiner, 1989). Moreover, David Burshtein (Burshtein, 1995) also

claimed that Gaussian pdf provides good approximation for word duration. By modeling word duration

using Gaussian pdf, the string error rate can be further reduced from 2.86% to 2.78% for the case of

unknown string length, and from 1.60% to 1.59% for the case of known string length as compared with

the baseline HMM. The Gaussian duration density function can be formulated as

p dd d

w j

w j

w j

w j

,

,

,

,

() exp( )

=⋅ ∇

⋅ −−

⋅ ∇

1

2 2

2

π. (16)

3. Comparison of state duration modeling methods

3.1 Databases and experimental conditions

A task of multi-speaker isolated Mandarin digit recognition was conducted for the comparison of

those state duration modeling methods described above. The database for the experiments were

provided by 50 male and 50 female speakers. Each speaker was asked to utter a set of 10 Mandarin

digits in each of three sessions. Totally, there were 3000 utterances recorded with the sampling rate of 8

KHz. Each frame, which contained 256 samples with 128 samples overlapped, was multiplied by a

256-point Hamming window. The pre-silence and post-silence of 0.1 ~ 0.5 seconds were included. Each

digit was modeled as a left-to-right HMM of 7 ~ 9 states, including the pre-silence and the post-silence

states, without jumps. The output of each state was a Gaussian distribution of feature vectors. The feature

vector was composed of 12-order LPC derived cepstral coefficients, 12-order delta cepstral coefficients

and one delta log-energy.

The NOISEX-92 noise database (Varga et al., 1992) was used for generating the noisy speech. In

our study, three kinds of noises, including white noise, F16 cockpit noise and babble noise, were directly

Page 13: 129966864599036360[1]

13

added to the clean speech in time domain to simulate the speech contaminated by noise. When noise was

added to the clean speech, the signal-to-noise (SNR) was defined by the following equation :

SNRE

Es

n

= ⋅

10log , (17)

where Es was the total energy of clean speech and En was the energy of the added noise over the

entire speech portion. The F16 cockpit noise was recorded at the co-pilot’s seat in a two-seat F16

traveling at a speed of 500 knots and an altitude of 300-600 feet. The source of babble noise was 100

people speaking in a canteen and in which individual voices were slightly audible.

The subsequent experiments were conducted to examine the following problems : (1) the effectiveness

of state duration modeling methods, (2) the incorporation of state duration modeling in training phase, and

(3) the robustness of state duration modeling methods in noisy environment.

3.2 Effectiveness of state duration modeling methods

The first two sessions of collected utterances in the database were used to train an initial set of word

models by using the segmental k-means algorithm (Rabiner et al., 1986). Once a conventional

HMM-based word model (denoted as ‘Baseline’ HMM) was established for each isolated Mandarin

digit, the training utterances were time-aligned with their corresponding word models. By using the

standard Viterbi decoding algorithm, we can re-decode each utterance into a state sequence and from

which the number of frames spent on every state is known. Based on these decoded state durations, we

can find the distribution of state duration for each state in a word model. This distribution can be treated

as the non-parametric modeling for a state duration and denoted as HMM/Npar. In Fig. 1, we show the

duration distributions of seven states in the HMM/Npar for isolated Mandarin digit ‘4’. The state duration

distributions modeled by Poisson, gamma, Gaussian and bounded density functions are also illustrated in

Page 14: 129966864599036360[1]

14

Fig. 1 for comparison and denoted as HMM/Pois, HMM/Gam, HMM/Gau and HMM/BSD,

respectively. The third session of collected utterances was used as a clean version of testing data for

evaluating the effectiveness of various state duration modeling methods. In the recognition phase, a testing

utterance is decoded into a state sequence by using the standard Viterbi decoding algorithm for the

‘Baseline’ HMM method, while using the three-dimensional Viterbi decoding algorithm, i,e., Eqs. (4)-(6),

for other state duration modeling methods. The resulted recognition rates for various state duration

modeling methods are shown in Table I.

( Fig. 1 and Table I about here )

Let us examine the state duration distributions of HMM/Npar shown in Fig. 1. We can find that the

distribution of state duration is different from state to state and can not be confined to a certain type of

probability density function. No single probability density function can fit to the statistical characteristics of

all the states in a word model. Furthermore, we can also find that HMM/Gam and HMM/Gau are more

capable than HMM/Pois and HMM/BSD in modeling the state duration distributions represented by

HMM/Npar. Particularly, gamma function is slightly better than Gaussian function. This result is consistent

with the conclusion given by David Burshtein (Burshtein, 1995) which stated that the gamma function can

provide high quality approximations for state duration and word duration. For HMM/BSD, lower and

upper bounds of state duration can prevent any state from occupying too many or too few frames.

However, the state duration distribution in the range of allowable duration is treated as an uniform

distribution which can not well approximate the actual distribution of state duration. This fact does affect

the performance as shown in Table I. From the experimental results shown in Table I, we can find that

the HMMs employing non-parametric, gamma and Gaussian state duration models have slightly higher

recognition rate than that of the baseline HMM. Also, the recognition rate of HMM/Gam is superior to

those of other methods. It concludes that a good modeling method for state duration can improve the

Page 15: 129966864599036360[1]

15

recognition accuracy.

3.3 Incorporation of state duration modeling in training phase

When statistics of state duration is considered only in the recognition phase but not in the training phase,

it will result in quite loose state duration constraints (Laurila, 1997). To solve this inconsistency problem,

a variable duration hidden Markov model (VDHMM) (Levinson, 1986, Rabiner et al., 1989 & Laurila,

1997) which incorporates state duration statistics into both training and recognition phases of a word

model has been proposed to seek for further improvement in recognition accuracy. The duration

distribution of each state in a word model can be obtained as follows :

Step 1. The segmental k-means algorithm and standard Viterbi decoding method are used to train an

initial set of word models.

Step 2. The duration statistics for each state in a word model are estimated and modeled by

non-parametric or parametric methods.

Step 3. Using the three-dimensional Viterbi decoding algorithm, each training utterance is decoded

into a maximum likelihood state sequence.

Step 4. According to those maximum likelihood state sequences, the statistics of each state is

re-calculated and the parameters of underlying state duration model are also revised. Step 3

and step 4 are iterated several times to come out a final set of desired word models.

In Fig. 2, we show the duration distributions of seven states in those VDHMMs for isolated Mandarin

digit ‘4’ using various state duration modeling methods. The variable duration HMMs with

non-parametric, Poisson, gamma, Gaussian and bounded state duration density functions are denoted as

VDHMM/Npar, VDHMM/Pois, VDHMM/Gam, VDHMM/Gau and VDHMM/BSD, respectively.

Moreover, The clean speech recognition rates based on variable duration HMMs are also shown in

Page 16: 129966864599036360[1]

16

Table II. Comparing Fig. 1 and Fig. 2, it reveals that tighter duration constraints make the fluctuation

phenomenon of some state duration distributions in the HMM/Npar more obvious. This phenomenon can

be found in the 4-th, 5-th and 6-th states of word model ‘4’. In addition, the duration distributions of

some states (e.g., 3-rd and 7-th states) become more concentrated and sharper. Table I and Table II

show that no matter employing non-parametric or parametric approaches, VDHMM methods are better

than the corresponding HMM methods. Since there are two confusion sets in Mandarin digit speech (“1”

vs. “7” and “6” vs. “9”), the recognition rate can hardly be further improved in clean speech for this

specific task. Even though the improvement is small, it does demonstrate the effectiveness of applying

state duration models in both training and recognition phases. ( Fig. 2 and Table II about here )

3.4 Robustness of state duration modeling methods

When a speech recognition system is deployed in a noisy environment, the background noise will

cause the mismatch of statistical characteristics between testing speech and reference models. Due to the

environmental mismatch, it is very possible that some state with very high likelihood scores will dominate

the result of decoding process (Zeljkovic, 1996). Thus, an erroneous maximum likelihood state sequence

with state duration too long or too short may be obtained even if a state duration modeling method is

employed. This phenomenon will cause the drastic degradation of recognition rate of a speech recognizer.

In this subsection, a series of experiments were conducted to evaluate the robustness of various methods

for modeling state duration in noisy environment.

In our experiments, the first two sessions of collected utterances in the database were used to train a

set of word models. To generate noisy speech, a noise with specific SNR values was added to the clean

testing data, i.e., the third session in the database. Those distorted utterances were then evaluated on their

corresponding word models and decoded into state sequences. Thus, from those most likely state

Page 17: 129966864599036360[1]

17

sequences, we can find the state duration distributions under the influence of additive white noise. In Fig.

3 through Fig. 6, the duration distributions of the 5-th and the 6-th states of isolated Mandarin digit ‘4’

under the influence of white noise are plotted. In addition, the recognition rates under the influence of

white noise, F16 cockpit noise and babble noise for various HMMs and VDHMMs are also presented in

Table III and Table IV.

( Fig. 3 - Fig. 6, Table III - Table IV about here )

The results in Table III and Table IV convince that properly employing a duration model does improve

the recognition accuracy in noisy environment. Above all, further improvement can be obtained by using a

variable duration hidden Markov model. The performances of those HMMs and VDHMMs in different

noisy environment are similar to the results listed in Table I and Table II for clean speech recognition. It is

worth to note that at SNR = 0 dB, the recognition rates based on the bounded state duration (BSD)

modeling method are higher than those of the models based on other parametric duration modeling

methods. One explanation is that the BSD method is more effective than other parametric modeling

methods in inhibiting one state occupying too long or too short of speech frames. From Fig. 3 through Fig.

6, we can also find that the additive white noise has the effect to distort the duration distribution of each

state in a word model. When the background environment becomes more noisy, the duration distribution

of the 5-th state of Mandarin digit “4” is gradually shifted to the left while the 6-th state to the right.

Especially, when the signal-to-noise ratio is very low, e.g., 0 dB, the duration density functions of some

states become extremely concentrated at some unexpected duration lengths even with the helps of state

duration modeling methods. This implies that the underlying duration density functions of those modeling

methods are not robust enough to noise contamination. For some state duration modeling methods, the

probability density functions are relatively smooth in the range of allowable duration lengths. This will

reduce the discriminativity of duration lengths in noisy environment and results in erroneous state

Page 18: 129966864599036360[1]

18

sequence. Moreover, due to the parametric nature, i.e., widespread range of state duration distribution, it

is very possible for a state to stay too long or too short in decoding a state sequence. From above

discussion, we conclude that : (1) The non-parametric duration modeling method can accurately specify

the state duration distribution of each state in a hidden Markov model. (2) The duration modeling method

must be applied in both the training and recognition phases so that the state duration constraints in these

two phases are consistent. (3) A sharper pdf of state duration may enhance the discriminativity of the

allowable duration lengths. (4) A narrow distribution range of state duration can efficiently prevent a

decoded state from being too long or too short.

4. Implementation of the VDHMM/PAD

In this section, a proportional alignment decoding (PAD) algorithm (Hung & Wang, 1997) combining

with the statistics of state durations is proposed to re-train a conventional hidden Markov model and

results in a more robust variable duration hidden Markov model (VDHMM/PAD). Instead of the widely

used Viterbi decoding algorithm, the proportional alignment decoding algorithm is used for state decoding

in the intermediate stage of training a word model. It produces a new set of state duration statistics in

which the distribution of state duration becomes sharper and more concentrated. This meets the

conclusion made in the previous section. It is also worth to note that the PAD method is not implemented

in the recognition phase. The detailed implementation of VDHMM/PAD is described as follows.

4.1 Formulation of the proportional alignment decoding algorithm

Consider the training of a word model λ()w that belongs to the set of M word models. The

parameter set of the word model λ()w is represented as λ µ() , , , , w w w w w w= Σ Ρ Α Β , where

µ µw w j= , and Σ Σw w j= , for 1≤ ≤j Sw denote the mean vector and covariance matrix of the

Page 19: 129966864599036360[1]

19

j-th state in the word model λ()w , respectively. Ρw w jp d= (), , Αw wija= , and

Βw w jb O= (), for 1≤ ≤j Sw represent the probability density functions of state durations, state

transitions and state outputs for the word model λ()w , respectively. It is noted that the probability

density function p dw j,() is modeled by the non-parametric duration modeling method. Let

Χ Χ() (), w w t Nt w= ≤ ≤1 be a set of feature vector sequences extracted from all the training

utterances for the word model λ()w . Here, Χtw() denotes the feature vector sequence of the t-th

training utterance which has Ktw frames, This feature vector sequence can be expressed as

Χt tw

tw

tKww x x x

tw() , , ,

= ⋅⋅ ⋅1 2 . Then, in a continuous-density HMM, the output probability density function,

b xw j tkw

, ,( ), can be characterized by a Gaussian function defined as follows :

b xw j tkw

w j

D

, , ,( ) ( )= ⋅ ⋅− −2 2

12π Σ exp( ) ( ) ( ), , , , ,− ⋅ − ⋅ ⋅ −−1

21x xtk

ww j

Tw j tk

ww jµ µΣ , (18)

where D is the dimension of feature vectorxtkw, .

Based on the set of word models λ λ= ≤ ≤(), w w M1 and the standard Viterbi decoding

algorithm, we can decode the t-th training utterance of word w, X wt(), into a state sequence

q q q qwt wt wt wtKt

w, ,, ,, ,,

= ⋅ ⋅ ⋅1 2 . Assume dw jt,, denotes the duration of state j in the maximum likelihood

state sequence of the t-th training utterance for the word model λ()w . Then, the state duration mean,

dw j, , of state j in the word model λ()w is formulated as

dN

dw jw

w jtt

Nw

, ,,==∑11

for 1≤ ≤j Sw . (19)

Moreover, The word duration mean dw defined as the accumulation of all the state duration means in

the word model λ()w can also be expressed as

Page 20: 129966864599036360[1]

20

d dw w jj

Sw

==

∑ ,1

. (20)

Then, the state duration ratio of the j-th state to the total states in the word model λ()w can be

calculated by

ℜ =jw w j

w

d

d

, for 1≤ ≤j Sw (21)

Once we obtain ℜjw for all states in every word model, the proportional alignment decoding procedure

can be proceeded in a simple way and each training utterance of word w is re-decoded into a new

state sequence, ~,

qw t

, where

~ ~ ~ ~ ,, ,, ,, ,,

q q q qw t w t w t w tKt

w= ⋅⋅ ⋅1 2 1 1≤ ≤ ≤ ≤w M t Nw, . (22)

For example, the t-th training utterance of word w has duration of Ktw frames, we segment this training

utterance into Sw states according to the following rules :

x wtkw

v, ()∈Ω and ~ ,,,q vw tk=

iff. k K Kjw

j

v

tw

jw

tw

j

v

∈ ℜ ⋅ + ℜ ⋅=

=∑ ∑[( ) ,( ) ],1

1

1

1 (23)

where Ω Ω() (), w w v Sv w= ≤ ≤1 . Ωv w( ) is the set of collected vectors belonging to state v in

the word model λ()w .

4.2 Training procedure of VDHMM/PAD

The training procedure works as follows.

Step 1. Obtain initial word models.

Employing segmental k-means algorithm (Juang et al., 1990) and standard Viterbi decoding

algorithm, all the feature vectors extracted from training utterances of word w are used to train

Page 21: 129966864599036360[1]

21

an initial word model λp w−1(), where p= 0 and 1≤ ≤w M .

Step 2. Decode training utterances and update word models.

(1) Based on the initial word model λp w−1(), standard Viterbi decoding algorithm is used to

decode each training utterance, such that

q pX w q w pq ww t

p

qt w t

pw t

p

wt

, , ,argmax ( ( ) , ( ))( ( ))

,

− − −= ⋅1 1 1λ λ , 1 1≤ ≤ ≤ ≤w M t Nw, . (24)

(2) The decoded state sequence is denoted as q q q qw t

pwtp

w tp

wtKp

tw, ,, ,, ,,.− − − −= ⋅ ⋅ ⋅1

11

21 1

(3) Let Ω Ωpjp

ww w j S− −= ≤ ≤1 1 1( ) ( ), , and Ωjp w−1( ) be the set of vectors of state j

in word model λp w−1(). For a feature vector of k-th frame in utterance t, xtkw, , this frame

belongs to Ωjp w−1( ) if its corresponding state belongs to state j in model λp w−1( ).

Then the duration of state j is equal to the number of vectors in utterance t belonging to

Ωjp w−1( ). The duration set is expressed as d w d j St

pw jtp

w− −= ≤ ≤1 1 1() , ,, .

Step 3. Align state sequences using PAD method.

(1) Based on the duration set d wtp−1( ), we can find state duration mean dw j

p,−1

, word duration

mean dwp−1 and state duration ratio ℜ −

jw p, 1 for each state in the word model λp w−1()

via Eqs. (19)-(21).

(2) Every training utterance of word w is then proportionally segmented into Sw states by

using the Eq. (23). Thus we can find new state sequences q q q qwt

pwtp

w tp

w tK

p

tw, ,, ,, ,,

= ⋅ ⋅ ⋅1 2 .

(3) Rearrange the set of vectors collected in a state such that x wtkw

jp

, ( )∈Ω if its

corresponding state belongs to state j defined for model λp w( ). The new duration of

state j in utterance t, dw jtp,,, is obtained.

Page 22: 129966864599036360[1]

22

(4) Use the duration set d w d j Stp

w jtp

w( ) , ,,= ≤ ≤1 and the following equation to calculate

the distribution of state duration :

p d

d

Nfordw j

pd w jt

p

t

N

w

w

,

,,

()

( )

,= ≥=∑ Θ1 1. (25)

(5) Use Ω Ωpjp

ww w j S( ) ( ), = ≤ ≤1 to find the parameters set , , , µwp

wp

wp

wpΣ Α Β of the

word model λpw().

Step 4. Re-train the word models.

(1) Calculate the accumulated log-likelihood of Χ()w by

∆ Χpt

p

t

N

w p w ww

() log[ () ( )]≡=

∑ λ1

= +=

∑log( ( ) , ( ))]log[ ( )], ,

p w q w pq wt wt

p p

t

N

wt

p pw

Χ λ λ1

, (26)

where

pX w q w b xt wt

p pw q tk

w

k

K

wtkp

tw

( () , ( )) ( ), , ,

,,λ =

=∏1

(27)

and pq w aw t

p pw q q

k

K

wtkp

wtkp

tw

( ( )), , ,, ,,

λ =+

=

∏1

1

1

. (28)

(2) Based on the word model λpw(), we can use the three-dimensional Viterbi decoding

algorithm to find a maximum likelihood state sequence q q q qwt

pwtp

wtp

wtKp

tw, ,, ,, ,,

+ + + += ⋅ ⋅ ⋅111

21 1 for the

t-th training utterance.

(3) Collect the vectors such that x wtkw

jp

, ( )∈ +Ω 1 if its corresponding state belongs to state j

defined for model λp w+1( ).

(4) Use Ωp w+1() to update the model parameters and generate the new model λp w+1().

Page 23: 129966864599036360[1]

23

(5) Update the accumulated log-likelihood of Χ()w by

∆ Χpt

p

t

N

w p w ww

+ +

=

= ∑1 1

1

() log[ () ()]λ , (29)

where the likelihood function p w wtp[ () ()]Χ λ +1 can be evaluated efficiently by using Eqs.

(4)-(6).

(6) Convergence testing.

IF the improvement rate of ∆p w+1() is greater than a preset threshold ∆th , i.e.,

∆ ∆∆

∆p p

p thw ww

+ −>

1() ( )( )

, (30)

THEN p p+ →1 and repeat Steps 4.(2)-4.(6),

ELSE λ λpVDHMMPADw w+ →1() ()/ .

4.3 Recognition procedure of VDHMM/PAD

Consider a testing utterance Υ with Ty frames and Υ = ⋅⋅ ⋅yy yTy1 2 , where yj denotes the

feature vector of j-th frame. The recognition procedure based upon the VDHMM/PAD is proceeded

as follows.

Step 1. Set w=1.

Step 2. Use the three-dimensional Viterbi decoding algorithm to find a maximum likelihood state

sequence q** for the testing utterance Υ based on the word model λVDHMMPADw/ ().

Step 3. Calculate the likelihood score of Υ for the word model λVDHMMPADw/ () by using Eqs.

(4)-(6), i.e.,

p wVDHMMPAD[ | ()]/Υ λ = p q w pq wVDHMMPAD VDHMMPAD[ | , ( )][ | ( )]**/

**/Υ λ λ⋅ , (31)

Page 24: 129966864599036360[1]

24

Step 4. w w+ →1 .

IF w M≤ , THEN repeat Step 2 to Step 4,

ELSE go to Step 5.

Step 5. Select the word whose likelihood score is highest, i.e.,

w p ww

VDHMMPAD*

/argmax[| ()]= Υ λ . (32)

5. Experiments and discussion

In this section, the same procedure described in Section 3.2 is used to find the distribution of state

duration in the VDHMM/PAD. Moreover, to demonstrate the behaviors of state duration distributions of

VDHMM/PAD under the influence of white noise, the same experiments conducted in Section 3.4 are

also implemented here. Fig. 7 and Fig. 8 show the state duration distributions of seven states in the

VDHMM/PAD for isolated Mandarin digit ‘4’ and the distorted state duration distributions due to white

noise contamination. The recognition rates of VDHMM/PAD under the influences of white noise, F16

cockpit noise and babble noise are also listed in Table V. Furthermore, in order to make the comparison,

we plotted those experimental results listed in Tables I-V on Fig. 9. From those experimental results we

observe the following facts :

(1) Distribution of state duration

Comparing Fig. 7 with Fig. 1 and Fig. 2, we can find that for conventional HMM employing various

state duration modeling methods, the distribution of state duration is relatively smooth and widespread.

By incorporating state duration statistics into the training phase, the variable duration HMMs make the

duration distributions of some states more concentrated and sharper. It results in the higher recognition

rate. In Fig. 7, we can observe that for most of states (e.g., 2-nd, 4-th, 5-th and 6-th states) the

allowable ranges of state duration modeled by VDHMM/PAD become more concentrated. The

Page 25: 129966864599036360[1]

25

shapes of state duration distributions are sharper than those of HMMs and VDHMMs. In addition,

comparing with those state duration distributions shown in Fig.1 and Fig.2, the probability fluctuation in

the VDHMM/PAD is more severe. This fluctuation phenomenon occurs in the duration distributions of

2-nd, 4-th and 6-th states of VDHMM/Npar and is considered to be helpful for enhancing its

discriminativity in recognizing the noisy speech.

(2) Robustness to noise contamination

When speech signal is contaminated by white noise, the state duration distributions shown in Fig. 3

through Fig. 6 are affected and distorted. Especially, under SNR = 0 dB, duration distributions are

severely distorted and concentrated extremely at some unexpected duration lengths. Using Fig. 3 and

Fig. 4 as examples, we can find that for some HMMs (e.g., HMM/Npar, VDHMM/BSD) the

duration distribution of 5-th state excessively concentrates at duration length of 3 frames for SNR = 0

dB while the other HMMs (e.g., HMM/Gam, VDHMM/Gau) at duration length of one frame.

Moreover, the maximum probabilities of 5-th state duration are also dramatically increased from about

0.2~0.3 up to 0.8~1.0. In contrast to those state duration distributions described in Fig. 3 through Fig.

6, we can observe from Fig. 8 that even under the influence of white noise, the original ranges of state

duration in the VDHMM/PAD keep almost unchanged and the duration distributions are less distorted

by ambient noises. When the SNR value is reduced to 0 dB, the maximum probability of 5-th state

duration is increased from 0.25 up to 0.45. This implies that the VDHMM/PAD is more effective than

other duration modeling methods in preventing the state duration distribution from extremely

concentrating at a specific duration length.

(3) Performance of noisy speech recognition

The recognition rates listed in Table V and the performances shown in Fig. 9 tell us that the

VDHMM/PAD outperforms those HMMs and VDHMMs employing other duration modeling

Page 26: 129966864599036360[1]

26

methods in noisy environment. The improvement is obvious at medium SNR (10 to 15 dB) in the case

of white noise and at low SNR (0 to 5 dB) in the case of F16 cockpit noise and babble noise.

Especially, when the distortion due to ambient noises is serious, such as distortion due to white noise,

the improvement of recognition rates is obvious. The superiority of VDHMM/PAD to the other hidden

Markov models we discussed is essentially due to its novel state duration distributions. It is evident that

the sharper and more concentrated duration distributions and relatively more fluctuated duration

density function facilitate the VDHMM/PAD with better discriminativity and modeling capability in

noisy environment. Moreover, it is noted that the VDHMM/PAD performs worse than the other

hidden Markov models in clean condition. The reason for this phenomenon can be explained as

follows. The PAD method proportionally segments each training utterance into states. This

segmentation mechanism narrows the allowable ranges of some state duration distributions. Thus, a

property of the VDHMM/PAD is that it can efficiently prevent any state from occupying too long or

too short and gain performance benefits in noisy environment. However, this method will also cause

duration mismatch between clean testing speech and reference models. This mismatch makes the

recognition performance of VDHMM/PAD to degrade slightly in clean conditions as comparing with

other hidden Markov models.

( Fig. 7 - Fig. 9, Table V about here )

6. Conclusion

In this paper, we first demonstrated the distribution of state duration in a conventional HMM and

compared the effectiveness and performance of some widely used modeling methods for state duration in

noisy environment. Based upon the weakness of those modeling method we evaluated, a proportional

alignment decoding algorithm (PAD) combining with the statistics of state duration is then proposed in the

Page 27: 129966864599036360[1]

27

training phase to re-train a conventional hidden Markov model and produce a new variable duration

hidden Markov model (VDHMM/PAD). The PAD method enables us to make the distribution of state

duration sharper, more fluctuated and relatively concentrated, and thus improve the model discriminativity

for allowable duration lengths under the influence of ambient noises. Experimental results have

demonstrated the robustness of VDHMM/PAD in noisy speech recognition. The proposed method can

provide better recognition rates than the conventional HMM and the other duration modeling methods in

various noisy environments.

Acknowledgement

The authors would like to thank Dr. Lee Lee-Min of Mingchi Institute of Technology, Taipei, Taiwan,

for his enthusiasm in supporting valuable programming experiences and many fruitful discussions.

References

Anastasakos, A., Schwartz, R. & Shu, H. (1995), Duration modeling in large vocabulary speech recognition.

Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp.

628-631.

Bonafonte, A., Vidal, J. & Nogueiras, A. (1996). Duration modeling with Expanded HMM applied to

Speech Recognition. Proceedings of International Conference on Spoken Language Processing, pp.

1097-1100.

Burshtein, D. (1995). Robust parametric modeling of durations in hidden Markov models. Proceedings of the

IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 548-551.

Page 28: 129966864599036360[1]

28

Gu, H. Y., Tseng, C. Y. & Lee, L. S. (1991). Isolated - Utterance Speech Recognition Using Hidden

Markov Models with Bounded State Durations. IEEE Trans. on Signal Processing, vol. 39, no.8 , pp.

1743-1752, August.

Hung, W. W. & Wang, H. C. (1997). HMM retraining based on state duration alignment for noisy speech

recognition. in Proc. of European Conference on Speech Communication and Technology

(EUROSPEECH), vol. 3, pp. 1519-1522, September.

Juang, B. H. & Rabiner, L. R. (1985). Mixture autoregressive hidden Markov models for speech signals.

IEEE Trans. Acoust., Speech, Signal Processing, vol. 33, no. 5, pp. 1404-1413.

Juang B. H. and Rabiner L. R. (1990). The segmental k-means algorithm for estimating parameters of hidden

Markov models. IEEE Trans. On Acoustics, Speech, Signal Processing, vol. 38, pp. 1639-1641,

September.

Kim, W. G., Yoon, J. Y. & Youn, D. H. (1994). HMM with global path constraint in Viterbi decoding for

isolated word recognition. Proceedings of the IEEE International Conference on Acoustic, Speech and

Signal Processing, pp. 605-608.

Laurila, K. (1997). Noise robust speech recognition with state duration constraints. Proceedings of the IEEE

International Conference on Acoustic, Speech and Signal Processing, pp. 871-874.

Lee, L. M. & Wang, H. C. (1994). A study on adaptation of cepstral and delta cepstral coefficients for noisy

speech recognition. Proceedings of International Conference on Spoken Language Processing, pp.

1011-1014.

Levinson, S. E. (1986). Continuously variable duration hidden Markov models for speech analysis.

Proceedings of the IEEE International Conference on Acoustic, Speech and Signal Processing, pp.

1241-1244.

Power, K. (1996). Durational modeling for improved connected digit recognition. Proceedings of

Page 29: 129966864599036360[1]

29

International Conference on Spoken Language Processing, pp. 885-888.

Rabiner, L. R. (1989). A tutorial on hidden Markov models and selected applications in

speech recognition. Proc. IEEE, vol. 77, no. 2, pp. 257-286.

Rabiner, L. R. & Juang, B.H. (1986). An introduction to hidden Markov model. IEEE ASSP Magazine,

January, pp.4-16.

Rabiner, L. R., Juang, B. H., Levinson, S. E. & Sondhi, M. M. (1985). Recognition of isolated digits using

hidden Markov models with continuous mixture densities. AT&T Tech. J. , vol. 64, no. 6, pp.

1211-1234, July-Aug.

Rabiner, L. R., Wilpson, J. G. & Juang, B. H. (1986). A segmental k-means training procedure for

connected word recognition. AT&T Technical Journal, Vol. 65, pp. 21-31.

Rabiner, L. R., Wilpon, J. G. & Soong, F. K. (1988). High performance connected digit recognition, using

hidden Markov models. Proceedings of the IEEE International Conference on Acoustic, Speech and

Signal Processing, pp. 119-122.

Russell, M. J. & Cook, A. E. (1987). Experimental evaluation of duration modeling techniques for automatic

speech recognition. Proceedings of the IEEE International Conference on Acoustic, Speech and Signal

Processing, pp. 2376-2379.

Russell, M. J. & Moore, R. K. (1985). Explicit modeling of state occupancy in hidden Markov models

for automatic speech recognition. Proceedings of the IEEE International Conference on Acoustic, Speech

and Signal Processing, pp. 5-8.

Varga, A. Steeneken, H.J.M., Tomlinson, M. & Jones, D. (1992). The NOISEX-92 study on the effect of

additive noise on automatic speech recognition, Technical Report, DRA Speech Research Unit, Malvern,

England.

Vaseghi, S. V. (1995). State duration modeling in hidden Markov models. Signal Processing, Vol. 41, pp.

Page 30: 129966864599036360[1]

30

31-41.

Zeljkovic, I. (1996). Decoding optimal state sequence with smooth state likelihoods. Proceedings of the

IEEE International Conference on Acoustic, Speech and Signal Processing, pp. 129-132.

Page 31: 129966864599036360[1]

31

Table I. Clean speech recognition rates for HMMs using various state duration modeling methods.

methods baseline HMM /Npar

HMM /Gam

HMM /Gau

HMM /Pois

HMM /BSD

recognition rates

97.2

97.6

97.5

97.4

97.2

96.8

Table II. Clean speech recognition rates for VDHMMs using various state duration modeling methods.

methods baseline VDHMM /Npar

VDHMM /Gam

VDHMM /Gau

VDHMM /Pois

VDHMM /BSD

recognition rates

97.2

97.6

97.6

97.5

97.4

97.1

Table III. Noisy speech recognition rates for HMMs using various state duration modeling methods

(a) white noise.

methods

SNR

baseline HMM /Npar

HMM /Gam

HMM /Gau

HMM /Pois

HMM /BSD

clean 97.2 97.6 97.5 97.4 97.2 96.8 20dB 48.8 62.0 60.9 60.4 59.6 57.0 15dB 30.8 42.8 41.1 40.5 40.2 38.5 10dB 19.2 26.8 25.4 24.7 25.3 23.6 5dB 11.2 20.8 20.1 19.4 19.7 19.3 0dB 10.0 17.6 16.4 16.0 16.0 17.6

Table III. Noisy speech recognition rates for HMMs using various state duration modeling methods

(b) F16 cockpit noise.

methods

SNR

baseline HMM /Npar

HMM /Gam

HMM /Gau

HMM /Pois

HMM /BSD

20dB 92.0 95.2 93.8 93.5 93.2 92.8 15dB 79.6 85.5 83.6 81.7 80.8 80.1 10dB 67.6 74.7 73.2 72.8 72.5 71.6 5dB 44.0 54.3 53.7 52.8 53.4 52.2 0dB 15.2 25.6 23.5 22.5 22.8 22.3

Page 32: 129966864599036360[1]

32

Table III. Noisy speech recognition rates for HMMs using various state duration modeling methods

(c) babble noise.

methods

SNR

baseline HMM /Npar

HMM /Gam

HMM /Gau

HMM /Pois

HMM /BSD

20dB 94.8 95.9 95.6 95.4 95.2 94.9 15dB 88.0 92.2 91.1 90.3 89.7 88.2 10dB 75.2 80.4 79.3 76.9 77.8 75.6 5dB 58.4 70.4 68.9 65.8 66.1 63.7 0dB 33.2 42.8 41.4 38.6 39.3 38.5

Table IV. Noisy speech recognition rates for VDHMMs using various state duration modeling methods

(a) white noise.

methods

SNR

baseline VDHMM /Npar

VDHMM /Gam

VDHMM /Gau

VDHMM /Pois

VDHMM /BSD

clean 97.2 97.6 97.6 97.5 97.4 97.1 20dB 48.8 67.6 64.8 63.9 61.6 59.4 15dB 30.8 49.2 46.8 45.9 43.6 42.1 10dB 19.2 31.2 29.0 27.4 28.4 26.9 5dB 11.2 24.0 22.8 21.7 22.0 20.8 0dB 10.0 18.4 17.3 17.1 17.2 18.5

Table IV. Noisy speech recognition rates for VDHMMs using various state duration modeling methods

(b) F16 cockpit noise.

methods

SNR

baseline VDHMM /Npar

VDHMM /Gam

VDHMM /Gau

VDHMM /Pois

VDHMM /BSD

20dB 92.0 96.0 94.4 94.3 94.0 93.6 15dB 79.6 86.4 84.1 82.3 81.4 80.9 10dB 67.6 76.3 74.5 73.9 73.8 72.5 5dB 44.0 55.3 54.8 53.5 54.2 53.0 0dB 15.2 28.2 26.3 24.9 25.5 24.5

Page 33: 129966864599036360[1]

33

Table IV. Noisy speech recognition rates for VDHMMs using various state duration modeling methods

(c) babble noise.

methods

SNR

baseline VDHMM /Npar

VDHMM /Gam

VDHMM /Gau

VDHMM /Pois

VDHMM /BSD

20dB 94.8 96.4 96.2 95.8 95.6 95.3 15dB 88.0 93.5 91.8 90.9 90.6 89.4 10dB 75.2 82.4 80.8 79.3 80.1 77.2 5dB 58.4 71.5 69.7 66.1 67.3 65.2 0dB 33.2 45.2 43.6 40.9 42.1 40.6

Table V. Noisy speech recognition rates for VDHMM/PAD.

SNR noise type

clean 20 dB 15 dB 10 dB 5 dB 0 dB

white noise 96.8 72.4 60.0 44.0 29.6 24.8 F16 cockpit

noise 96.8 95.2 87.3 79.9 60.2 35.1

babble noise 96.8 95.9 94.4 84.7 76.4 52.1


Recommended