Date post: | 19-Jan-2016 |
Category: |
Documents |
Upload: | april-melton |
View: | 213 times |
Download: | 0 times |
1 / 27
John-Paul Hosom1 Alexander KainBrian O. Bush
Towards the Recovery of Targets from Coarticulated Speech for Automatic
Speech Recognition
Center for Spoken Language Understanding (CSLU)
Department of Biomedical Engineering (BME)
Oregon Health & Science University (OHSU)
1John-Paul Hosom is now with Sensory, Inc.
2 / 27
Outline
1. Introduction: Error Analysis of TIMIT ASR
2. Introduction: Hypothesis
3. Background: Characteristics of Clear Speech
4. Background: Formant Targets and Locus Theory
5. Objectives of Current Study
6. Corpus
7. Model
8. Methods
9. Results
10. Conclusions & Future Work
3 / 27
1. Introduction: Error Analysis of TIMIT
Phoneme ASR of TIMIT:• HMM/ANN system trained on TIMIT, decoded with
bigram language model [Hosom et al., 2010]
• Accuracy of 74% is high for HMM-based systems.• Vowel substitutions account for 35% of errors,
covering all distinctive-phonetic feature dimensions:– 62% of vowel substitutions have front/back error– 52% of vowel substitutions have tense/lax error– 68% of vowel substitutions have height error– 31% of vowel substitutions have vowel/cons. error
• Errors are not confined to specific type
4 / 27
2. Introduction: Hypothesis
Hypothesis:• Our hypothesis is that a main cause of ASR errors
is not the feature space or the classification technique (which might result in more distinct error patterns), but in “noise” in the probability estimates.
• This noise is caused by variability in the features, which can be reduced by estimating phoneme targets instead of the observed values.
• For now, we work in the formant space, althoughtargets need not be limited to this space.
• Also, to control for speaking style, we are interested in both “clear” and “conversational” speech.
5 / 27
3. Characteristics of Clear Speech
clear speech:observed midpointsof vowels, accuracy 90%
conversational speech:observed midpointsof vowels, accuracy 73%
• “Clear” speech has expanded vowel space and longer phoneme durations
6 / 27
• Using a “hybridization” algorithm that combined features of CLR and CNV speech and perceptual testing, we have shown over several experiments that the most relevant features for intelligibility are the combination of spectrum and duration. [Kain,
Amano-Kusumoto, and Hosom (2008); Kusumoto, Kain, Hosom, and van Santen (2007); Amano-Kusumoto and Hosom (2009)]
• This has led us to study a model coarticulation, to quantitatively model the change of formants over time.
3. Characteristics of Clear Speech
7 / 27
4. Formant Targets and Locus Theory
[From Klatt 1987, p. 753]
8 / 27
time
fre
que
ncy
/d/ /u/
• Most consonants (all except /j/, /l/, /r/, /w/) do not have visible formants
• They have “virtual” formants identified by coarticulation in the vowel.
4. Formant Targets and Locus Theory
9 / 27
4. Formant Targets and Locus Theory
[from Delattre et al., 1955 as reported in Johnson, 1997, p. 135]
10 / 27
4. Formant Targets and Locus Theory
Locus Theory Summary:
1. Vowels and consonants have formant targets;most consonants have “virtual” formants.
2. Coarticulation yields smooth change between targets when formants are visible.
3. If duration is too short, formants do not reachtheir targets, yielding undershoot.
4. Both the targets and the rate of change are important for intelligibility.
11 / 27
Outline
1. Introduction
2. Background: Error Analysis of TIMIT ASR
3. Background: Characteristics of Clear Speech
4. Background: Formant Targets and Locus Theory
5. Objectives of Current Study
6. Corpus
7. Model
8. Methods
9. Results
10. Conclusions & Future Work
12 / 27
5. Objectives of Current Study
• Estimate each phoneme’s target instead of relying on observed data.
• Using targets will reduce the variance of features, yielding conversational (and clear) speech with feature overlap similar to that of clear speech.
• Given a GMM of target values (instead of observed values), compute the probability of each token’s estimated targets given each possible phoneme.
• Use semi-Markov model to address trajectory of features. In real system, move from formants to another feature domain.
13 / 27
5. Objectives of Current Study
Mean and standard deviation of target values
14 / 27
6. Corpus
Corpus:
1. Male and female speaker
2. Sentences contain a neutral carrier phrase (5 total) followed by a target word (242 total)
3. Target words are common English CVC words with 23 initial and final consonants and 8 vowels.
4. All sentences spoken in both clear and conversational styles.
5. Two recordings per style of each sentence.
6. Formants and phoneme boundaries automatically estimated, manually corrected with verification.
15 / 27
Formant trajectory model:
7. Model
)(
222111
1
1),;(
)(),;()(),;()(ˆ21
pts
VCVVC
epstd
TTpstdTTTpstdtf
• is the estimated formant trajectory over time t.
• TC1, TV, TC2 are target formant values for C1, V, C2.
• is the degree of articulation of C1 or C2
• is a sigmoid function over time t.
• s is maximum slope of , p is position of s.
)(ˆ tf
),;( pstd
),;( pstd
),;( pstd
16 / 27
Formant trajectory model:
7. Model
17 / 27
Formant trajectory model:
7. Model
18 / 27
Estimating Model Parameters:1. Two sets of parameters to estimate:
(a) s1, s2, p1, p2 estimated on a per-token basis(b) TC1, TV, TC2 estimated on a per-token basis (independent of speaking style) and then averaged
2. For one token, error is:
3. Genetic algorithm used to find best estimates.Fitness function is error summed over all tokens.
8. Methods
2,)(ˆ)(
1 12
22
1
Ftt
tftfErr
F
i
t
tt ii
19 / 27
Estimating Model Parameters:
4. Genetic algorithm employs mutation, crossover (exchange of group of parameters), and elitism (best solutions retained in next generation).
5. Data partitioned into 20 folds for n-fold validation.
6. Performed 60 randomly-initialized starts in parameter estimation to get 60 points per phoneme.
8. Methods
20 / 27
9. Results
Histograms of vowel contribution:
Maximum contribution of vowel for both speaking styles(enforced minimum value of 0.6)
Clear speech Conversational Speech
21 / 27
9. Results
Target Estimation Results:
Estimated targets for vowels (60 points per phoneme)
22 / 27
9. Results
Target Estimation Results:
Estimated targets for C1 (60 points per phoneme)
23 / 27
9. Results
Target Estimation Results:
Vowel classification accuracy on training data: xx.x% (xx.x% CLR xx.x% CNV)
Vowel classification accuracy on test data:xx.x% (xx.x% CLR xx.x% CNV)
24 / 27
Coarticulation Parameter Results:
• Error surface between estimated model and observed data as a function of s and p:
• A low error can be obtained for many values of s.
9. Results
25 / 27
Coarticulation Parameter Results:• As a result, s shows differences in mean for
different consonants, but high variance:
• Therefore, s values do not cluster well, and• s values can not be reliably extracted for a single
CVC token.
9. Results
Mean and standard deviation of second-formant s1 values for 10 phonemes
26 / 27
10. Conclusions & Future Work
Conclusions:
1. Estimation of consonant and vowel targets can be performed reliably when estimating over a large number of CVCs.
2. Estimation of coarticulation parameter s can not be performed reliably (yet) for a single CVC.
3. Therefore, formant targets can not (yet) be reliably estimated for a single token, which is necessary to apply this work to automatic speech recognition.
27 / 27
Future Work:
1. Determine and apply constraints on s, so that coarticulation parameters and formant targets can be reliably estimated for a single token.
2. Given estimated targets for a CVC, estimate the probability of these targets given each phoneme:p(target | phoneme)
3. Use these probabilities instead of the probabilities currently used in speech recognition,p(observed data at 10-msec frame | phoneme)
4. Expand to recognize arbitrary length phoneme sequence and use non-formant features.
10. Conclusions & Future Work
28 / 27
This material is based upon work supported by the National Science Foundation under Grant IIS-091575.
Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Acknowledgements