+ All Categories
Home > Documents > 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from...

0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from...

Date post: 19-Jan-2016
Category:
Upload: april-melton
View: 213 times
Download: 0 times
Share this document with a friend
Popular Tags:
28
1 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center for Spoken Language Understanding (CSLU) Department of Biomedical Engineering (BME) Oregon Health & Science University (OHSU) n-Paul Hosom is now with Sensory, Inc.
Transcript
Page 1: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

1 / 27

John-Paul Hosom1 Alexander KainBrian O. Bush

Towards the Recovery of Targets from Coarticulated Speech for Automatic

Speech Recognition

Center for Spoken Language Understanding (CSLU)

Department of Biomedical Engineering (BME)

Oregon Health & Science University (OHSU)

1John-Paul Hosom is now with Sensory, Inc.

Page 2: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

2 / 27

Outline

1. Introduction: Error Analysis of TIMIT ASR

2. Introduction: Hypothesis

3. Background: Characteristics of Clear Speech

4. Background: Formant Targets and Locus Theory

5. Objectives of Current Study

6. Corpus

7. Model

8. Methods

9. Results

10. Conclusions & Future Work

Page 3: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

3 / 27

1. Introduction: Error Analysis of TIMIT

Phoneme ASR of TIMIT:• HMM/ANN system trained on TIMIT, decoded with

bigram language model [Hosom et al., 2010]

• Accuracy of 74% is high for HMM-based systems.• Vowel substitutions account for 35% of errors,

covering all distinctive-phonetic feature dimensions:– 62% of vowel substitutions have front/back error– 52% of vowel substitutions have tense/lax error– 68% of vowel substitutions have height error– 31% of vowel substitutions have vowel/cons. error

• Errors are not confined to specific type

Page 4: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

4 / 27

2. Introduction: Hypothesis

Hypothesis:• Our hypothesis is that a main cause of ASR errors

is not the feature space or the classification technique (which might result in more distinct error patterns), but in “noise” in the probability estimates.

• This noise is caused by variability in the features, which can be reduced by estimating phoneme targets instead of the observed values.

• For now, we work in the formant space, althoughtargets need not be limited to this space.

• Also, to control for speaking style, we are interested in both “clear” and “conversational” speech.

Page 5: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

5 / 27

3. Characteristics of Clear Speech

clear speech:observed midpointsof vowels, accuracy 90%

conversational speech:observed midpointsof vowels, accuracy 73%

• “Clear” speech has expanded vowel space and longer phoneme durations

Page 6: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

6 / 27

• Using a “hybridization” algorithm that combined features of CLR and CNV speech and perceptual testing, we have shown over several experiments that the most relevant features for intelligibility are the combination of spectrum and duration. [Kain,

Amano-Kusumoto, and Hosom (2008); Kusumoto, Kain, Hosom, and van Santen (2007); Amano-Kusumoto and Hosom (2009)]

• This has led us to study a model coarticulation, to quantitatively model the change of formants over time.

3. Characteristics of Clear Speech

Page 7: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

7 / 27

4. Formant Targets and Locus Theory

[From Klatt 1987, p. 753]

Page 8: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

8 / 27

time

fre

que

ncy

/d/ /u/

• Most consonants (all except /j/, /l/, /r/, /w/) do not have visible formants

• They have “virtual” formants identified by coarticulation in the vowel.

4. Formant Targets and Locus Theory

Page 9: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

9 / 27

4. Formant Targets and Locus Theory

[from Delattre et al., 1955 as reported in Johnson, 1997, p. 135]

Page 10: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

10 / 27

4. Formant Targets and Locus Theory

Locus Theory Summary:

1. Vowels and consonants have formant targets;most consonants have “virtual” formants.

2. Coarticulation yields smooth change between targets when formants are visible.

3. If duration is too short, formants do not reachtheir targets, yielding undershoot.

4. Both the targets and the rate of change are important for intelligibility.

Page 11: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

11 / 27

Outline

1. Introduction

2. Background: Error Analysis of TIMIT ASR

3. Background: Characteristics of Clear Speech

4. Background: Formant Targets and Locus Theory

5. Objectives of Current Study

6. Corpus

7. Model

8. Methods

9. Results

10. Conclusions & Future Work

Page 12: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

12 / 27

5. Objectives of Current Study

• Estimate each phoneme’s target instead of relying on observed data.

• Using targets will reduce the variance of features, yielding conversational (and clear) speech with feature overlap similar to that of clear speech.

• Given a GMM of target values (instead of observed values), compute the probability of each token’s estimated targets given each possible phoneme.

• Use semi-Markov model to address trajectory of features. In real system, move from formants to another feature domain.

Page 13: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

13 / 27

5. Objectives of Current Study

Mean and standard deviation of target values

Page 14: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

14 / 27

6. Corpus

Corpus:

1. Male and female speaker

2. Sentences contain a neutral carrier phrase (5 total) followed by a target word (242 total)

3. Target words are common English CVC words with 23 initial and final consonants and 8 vowels.

4. All sentences spoken in both clear and conversational styles.

5. Two recordings per style of each sentence.

6. Formants and phoneme boundaries automatically estimated, manually corrected with verification.

Page 15: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

15 / 27

Formant trajectory model:

7. Model

)(

222111

1

1),;(

)(),;()(),;()(ˆ21

pts

VCVVC

epstd

TTpstdTTTpstdtf

• is the estimated formant trajectory over time t.

• TC1, TV, TC2 are target formant values for C1, V, C2.

• is the degree of articulation of C1 or C2

• is a sigmoid function over time t.

• s is maximum slope of , p is position of s.

)(ˆ tf

),;( pstd

),;( pstd

),;( pstd

Page 16: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

16 / 27

Formant trajectory model:

7. Model

Page 17: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

17 / 27

Formant trajectory model:

7. Model

Page 18: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

18 / 27

Estimating Model Parameters:1. Two sets of parameters to estimate:

(a) s1, s2, p1, p2 estimated on a per-token basis(b) TC1, TV, TC2 estimated on a per-token basis (independent of speaking style) and then averaged

2. For one token, error is:

3. Genetic algorithm used to find best estimates.Fitness function is error summed over all tokens.

8. Methods

2,)(ˆ)(

1 12

22

1

Ftt

tftfErr

F

i

t

tt ii

Page 19: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

19 / 27

Estimating Model Parameters:

4. Genetic algorithm employs mutation, crossover (exchange of group of parameters), and elitism (best solutions retained in next generation).

5. Data partitioned into 20 folds for n-fold validation.

6. Performed 60 randomly-initialized starts in parameter estimation to get 60 points per phoneme.

8. Methods

Page 20: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

20 / 27

9. Results

Histograms of vowel contribution:

Maximum contribution of vowel for both speaking styles(enforced minimum value of 0.6)

Clear speech Conversational Speech

Page 21: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

21 / 27

9. Results

Target Estimation Results:

Estimated targets for vowels (60 points per phoneme)

Page 22: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

22 / 27

9. Results

Target Estimation Results:

Estimated targets for C1 (60 points per phoneme)

Page 23: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

23 / 27

9. Results

Target Estimation Results:

Vowel classification accuracy on training data: xx.x% (xx.x% CLR xx.x% CNV)

Vowel classification accuracy on test data:xx.x% (xx.x% CLR xx.x% CNV)

Page 24: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

24 / 27

Coarticulation Parameter Results:

• Error surface between estimated model and observed data as a function of s and p:

• A low error can be obtained for many values of s.

9. Results

Page 25: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

25 / 27

Coarticulation Parameter Results:• As a result, s shows differences in mean for

different consonants, but high variance:

• Therefore, s values do not cluster well, and• s values can not be reliably extracted for a single

CVC token.

9. Results

Mean and standard deviation of second-formant s1 values for 10 phonemes

Page 26: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

26 / 27

10. Conclusions & Future Work

Conclusions:

1. Estimation of consonant and vowel targets can be performed reliably when estimating over a large number of CVCs.

2. Estimation of coarticulation parameter s can not be performed reliably (yet) for a single CVC.

3. Therefore, formant targets can not (yet) be reliably estimated for a single token, which is necessary to apply this work to automatic speech recognition.

Page 27: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

27 / 27

Future Work:

1. Determine and apply constraints on s, so that coarticulation parameters and formant targets can be reliably estimated for a single token.

2. Given estimated targets for a CVC, estimate the probability of these targets given each phoneme:p(target | phoneme)

3. Use these probabilities instead of the probabilities currently used in speech recognition,p(observed data at 10-msec frame | phoneme)

4. Expand to recognize arbitrary length phoneme sequence and use non-formant features.

10. Conclusions & Future Work

Page 28: 0 / 27 John-Paul Hosom 1 Alexander Kain Brian O. Bush Towards the Recovery of Targets from Coarticulated Speech for Automatic Speech Recognition Center.

28 / 27

This material is based upon work supported by the National Science Foundation under Grant IIS-091575.

Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.

Acknowledgements


Recommended