0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from...

1 / 27

John-Paul Hosom1 Alexander KainBrian O. Bush

Towards the Recovery of Targets from Coarticulated Speech for Automatic

Speech Recognition

Center for Spoken Language Understanding (CSLU)

Department of Biomedical Engineering (BME)

Oregon Health & Science University (OHSU)

1John-Paul Hosom is now with Sensory, Inc.

2 / 27

Outline

1. Introduction: Error Analysis of TIMIT ASR

2. Introduction: Hypothesis

3. Background: Characteristics of Clear Speech

4. Background: Formant Targets and Locus Theory

5. Objectives of Current Study

6. Corpus

7. Model

8. Methods

9. Results

10. Conclusions & Future Work

3 / 27

1. Introduction: Error Analysis of TIMIT

Phoneme ASR of TIMIT:• HMM/ANN system trained on TIMIT, decoded with

bigram language model [Hosom et al., 2010]

• Accuracy of 74% is high for HMM-based systems.• Vowel substitutions account for 35% of errors,

covering all distinctive-phonetic feature dimensions:– 62% of vowel substitutions have front/back error– 52% of vowel substitutions have tense/lax error– 68% of vowel substitutions have height error– 31% of vowel substitutions have vowel/cons. error

• Errors are not confined to specific type

4 / 27

2. Introduction: Hypothesis

Hypothesis:• Our hypothesis is that a main cause of ASR errors

is not the feature space or the classification technique (which might result in more distinct error patterns), but in “noise” in the probability estimates.

• This noise is caused by variability in the features, which can be reduced by estimating phoneme targets instead of the observed values.

• For now, we work in the formant space, althoughtargets need not be limited to this space.

• Also, to control for speaking style, we are interested in both “clear” and “conversational” speech.

5 / 27

3. Characteristics of Clear Speech

clear speech:observed midpointsof vowels, accuracy 90%

conversational speech:observed midpointsof vowels, accuracy 73%

• “Clear” speech has expanded vowel space and longer phoneme durations

6 / 27

• Using a “hybridization” algorithm that combined features of CLR and CNV speech and perceptual testing, we have shown over several experiments that the most relevant features for intelligibility are the combination of spectrum and duration. [Kain,

Amano-Kusumoto, and Hosom (2008); Kusumoto, Kain, Hosom, and van Santen (2007); Amano-Kusumoto and Hosom (2009)]

• This has led us to study a model coarticulation, to quantitatively model the change of formants over time.

3. Characteristics of Clear Speech

7 / 27

4. Formant Targets and Locus Theory

[From Klatt 1987, p. 753]

8 / 27

time

fre

que

ncy

/d/ /u/

• Most consonants (all except /j/, /l/, /r/, /w/) do not have visible formants

• They have “virtual” formants identified by coarticulation in the vowel.


9 / 27


[from Delattre et al., 1955 as reported in Johnson, 1997, p. 135]

10 / 27


Locus Theory Summary:

1. Vowels and consonants have formant targets;most consonants have “virtual” formants.

2. Coarticulation yields smooth change between targets when formants are visible.

3. If duration is too short, formants do not reachtheir targets, yielding undershoot.

4. Both the targets and the rate of change are important for intelligibility.

11 / 27

Outline

1. Introduction

2. Background: Error Analysis of TIMIT ASR

3. Background: Characteristics of Clear Speech

4. Background: Formant Targets and Locus Theory


6. Corpus

7. Model

8. Methods

9. Results


12 / 27


• Estimate each phoneme’s target instead of relying on observed data.

• Using targets will reduce the variance of features, yielding conversational (and clear) speech with feature overlap similar to that of clear speech.

• Given a GMM of target values (instead of observed values), compute the probability of each token’s estimated targets given each possible phoneme.

• Use semi-Markov model to address trajectory of features. In real system, move from formants to another feature domain.

13 / 27


Mean and standard deviation of target values

14 / 27

6. Corpus

Corpus:

1. Male and female speaker

2. Sentences contain a neutral carrier phrase (5 total) followed by a target word (242 total)

3. Target words are common English CVC words with 23 initial and final consonants and 8 vowels.

4. All sentences spoken in both clear and conversational styles.

5. Two recordings per style of each sentence.

6. Formants and phoneme boundaries automatically estimated, manually corrected with verification.

15 / 27

Formant trajectory model:

7. Model

)(

222111

1

1),;(

)(),;()(),;()(ˆ21

pts

VCVVC

epstd

TTpstdTTTpstdtf

• is the estimated formant trajectory over time t.

• TC1, TV, TC2 are target formant values for C1, V, C2.

• is the degree of articulation of C1 or C2

• is a sigmoid function over time t.

• s is maximum slope of , p is position of s.

)(ˆ tf

),;( pstd

),;( pstd

),;( pstd

16 / 27


7. Model

17 / 27


7. Model

18 / 27

Estimating Model Parameters:1. Two sets of parameters to estimate:

(a) s1, s2, p1, p2 estimated on a per-token basis(b) TC1, TV, TC2 estimated on a per-token basis (independent of speaking style) and then averaged

2. For one token, error is:

3. Genetic algorithm used to find best estimates.Fitness function is error summed over all tokens.

8. Methods

2,)(ˆ)(

1 12

22

1

Ftt

tftfErr

F

i

t

tt ii

19 / 27

Estimating Model Parameters:

4. Genetic algorithm employs mutation, crossover (exchange of group of parameters), and elitism (best solutions retained in next generation).

5. Data partitioned into 20 folds for n-fold validation.

6. Performed 60 randomly-initialized starts in parameter estimation to get 60 points per phoneme.

8. Methods

20 / 27

9. Results

Histograms of vowel contribution:

Maximum contribution of vowel for both speaking styles(enforced minimum value of 0.6)

Clear speech Conversational Speech

21 / 27

9. Results

Target Estimation Results:

Estimated targets for vowels (60 points per phoneme)

22 / 27

9. Results


Estimated targets for C1 (60 points per phoneme)

23 / 27

9. Results


Vowel classification accuracy on training data: xx.x% (xx.x% CLR xx.x% CNV)

Vowel classification accuracy on test data:xx.x% (xx.x% CLR xx.x% CNV)

24 / 27

Coarticulation Parameter Results:

• Error surface between estimated model and observed data as a function of s and p:

• A low error can be obtained for many values of s.

9. Results

25 / 27

Coarticulation Parameter Results:• As a result, s shows differences in mean for

different consonants, but high variance:

• Therefore, s values do not cluster well, and• s values can not be reliably extracted for a single

CVC token.

9. Results

Mean and standard deviation of second-formant s1 values for 10 phonemes

26 / 27


Conclusions:

1. Estimation of consonant and vowel targets can be performed reliably when estimating over a large number of CVCs.

2. Estimation of coarticulation parameter s can not be performed reliably (yet) for a single CVC.

3. Therefore, formant targets can not (yet) be reliably estimated for a single token, which is necessary to apply this work to automatic speech recognition.

27 / 27

Future Work:

1. Determine and apply constraints on s, so that coarticulation parameters and formant targets can be reliably estimated for a single token.

2. Given estimated targets for a CVC, estimate the probability of these targets given each phoneme:p(target | phoneme)

3. Use these probabilities instead of the probabilities currently used in speech recognition,p(observed data at 10-msec frame | phoneme)

4. Expand to recognize arbitrary length phoneme sequence and use non-formant features.


28 / 27

This material is based upon work supported by the National Science Foundation under Grant IIS-091575.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Acknowledgements

Date post:	19-Jan-2016
Category:	Documents
Upload:	april-melton
View:	213 times
Download:	0 times

0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from...

Documents