Post on 02-Nov-2019
transcript
University of California
Los Angeles
Rapid Speaker Normalization and Adaptation
with Applications to Automatic Evaluation
of Children’s Language Learning Skills
A dissertation submitted in partial satisfaction
of the requirements for the degree
Doctor of Philosophy in Electrical Engineering
by
Shizhen Wang
2010
c© Copyright by
Shizhen Wang
2010
To my family
iii
Table of Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.1 Automatic Speech Recognition . . . . . . . . . . . . . . . . . . . . 1
1.1.1 Hidden Markov Models . . . . . . . . . . . . . . . . . . . . 1
1.1.2 Feature Extraction . . . . . . . . . . . . . . . . . . . . . . 6
1.1.3 Acoustic Modeling . . . . . . . . . . . . . . . . . . . . . . 7
1.1.4 Robustness Issues . . . . . . . . . . . . . . . . . . . . . . . 9
1.2 Speaker Normalization and Speaker Adaptation . . . . . . . . . . 10
1.2.1 Linear Frequency Warping . . . . . . . . . . . . . . . . . . 13
1.2.2 Maximum Likelihood Linear Regression . . . . . . . . . . . 14
1.3 Children’s Speech Recognition . . . . . . . . . . . . . . . . . . . . 15
1.4 Organization of the Dissertation . . . . . . . . . . . . . . . . . . . 17
2 Regression-Tree based Spectral Peak Alignment . . . . . . . . . 18
2.1 Frequency Warping as Linear Transformation . . . . . . . . . . . 18
2.1.1 Frequency Warping for MFCC . . . . . . . . . . . . . . . . 18
2.1.2 Approximated Linearization of Frequency Warping . . . . 20
2.1.3 Definition of the Frequency Warping Matrix . . . . . . . . 22
2.1.4 Linear Frequency Warping Functions . . . . . . . . . . . . 23
2.2 Alignment of Spectral Peaks . . . . . . . . . . . . . . . . . . . . . 24
2.2.1 Choice of the Reference Speaker . . . . . . . . . . . . . . . 24
2.2.2 Levels of Mismatch in Formant Structure . . . . . . . . . . 25
iv
2.3 Speaker Adaptation Using Spectral Peak Alignment . . . . . . . . 28
2.4 Regression-tree based Speaker Adaptation . . . . . . . . . . . . . 30
2.4.1 Global vs. Regression-tree based Peak Alignment . . . . . 30
2.4.2 Phoneme based Regression Tree . . . . . . . . . . . . . . . 31
2.4.3 Gaussian Mixture based Regression Tree . . . . . . . . . . 31
2.5 Integration of Peak Alignment with MLLR . . . . . . . . . . . . 32
2.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 34
2.6.1 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 34
2.6.2 Comparison of Global and Regression-tree based PAA ver-
sus MLLR and VTLN . . . . . . . . . . . . . . . . . . . . 36
2.6.3 Discussion on Comparison of RM1 and TIDIGITS . . . . . 40
2.6.4 Performance of the Linearization Appproximation . . . . . 41
2.6.5 Comparison of PAA, PSAT and MLLR-SAT . . . . . . . . 42
2.6.6 Comparison of PSAT and MLLR-SAT . . . . . . . . . . . 44
2.6.7 Comparison of Supervised and Unsupervised Adaptation . 45
2.6.8 Significance Analysis . . . . . . . . . . . . . . . . . . . . . 48
2.7 Summary and Conclusion . . . . . . . . . . . . . . . . . . . . . . 50
3 Speaker Normalization based on Subglottal Resonances . . . . 52
3.1 Subglottal Acoustic System and Its Coupling to Vocal Tract . . . 52
3.1.1 Subglottal Acoustic System . . . . . . . . . . . . . . . . . 52
3.1.2 Coupling between Subglottal and Supraglottal Systems . . 53
3.1.3 Effects of Coupling to Subglottal System . . . . . . . . . . 54
v
3.1.4 Subglottal Resonances and Phonological Distinctive Features 55
3.2 Estimating the Second Subglottal Resonance . . . . . . . . . . . . 57
3.2.1 Estimation based on Frequency Discontinuity . . . . . . . 57
3.2.2 Estimation based on Joint Frequency and Energy Measure-
ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
3.3 Calibration of the Sg2 Estimation Algorithm . . . . . . . . . . . . 62
3.4 Variability of Subglottal Resonance Sg2 . . . . . . . . . . . . . . . 67
3.4.1 The Bilingual Database . . . . . . . . . . . . . . . . . . . 67
3.4.2 Cross-content and Cross-language Variability . . . . . . . . 69
3.4.3 Implications of Sg2 Invariability . . . . . . . . . . . . . . . 70
3.5 Experiments with Linear Frequency Warping . . . . . . . . . . . . 71
3.5.1 Comparison of VTLN and Sg2 Frequency Warping . . . . 72
3.5.2 Effectiveness of Sg2 Normalization . . . . . . . . . . . . . 72
3.5.3 Comparison of Vowel Content Dependency . . . . . . . . . 75
3.5.4 Performance on RM1 Database . . . . . . . . . . . . . . . 76
3.5.5 Cross-language Speaker Normalization . . . . . . . . . . . 78
3.6 Nonlinear Frequency Warping . . . . . . . . . . . . . . . . . . . . 80
3.6.1 Mel-shift based Frequency Warping . . . . . . . . . . . . . 80
3.6.2 Bark-shift based Frequency Warping . . . . . . . . . . . . 81
3.7 Experiments with Nonlinear Frequency Warping . . . . . . . . . . 83
3.7.1 Sg2 based Nonlinear Frequency Warping . . . . . . . . . . 83
3.7.2 Experimental Setup . . . . . . . . . . . . . . . . . . . . . . 84
3.7.3 Experimental Results . . . . . . . . . . . . . . . . . . . . . 86
vi
3.8 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . 88
4 Automatic Evaluation of Children’s
Language Learning Skills . . . . . . . . . . . . . . . . . . . . . . . . . . 90
4.1 Technology based Assessment of Language and Literacy . . . . . . 90
4.2 Blending Tasks and Database Collections . . . . . . . . . . . . . . 91
4.2.1 Blending Tasks for Phonemic Awareness . . . . . . . . . . 91
4.2.2 Database Collections . . . . . . . . . . . . . . . . . . . . . 92
4.3 Human Evaluations and Discussions . . . . . . . . . . . . . . . . . 94
4.3.1 Web-based Teacher’s Assessment . . . . . . . . . . . . . . 94
4.3.2 Inter-correlation of the Assessment . . . . . . . . . . . . . 94
4.3.3 Discussions on the Blending Target Words . . . . . . . . . 95
4.4 Automatic Evaluation System . . . . . . . . . . . . . . . . . . . . 96
4.4.1 Overall System Flowchart . . . . . . . . . . . . . . . . . . 96
4.4.2 Disfluency Detection . . . . . . . . . . . . . . . . . . . . . 97
4.4.3 Accent Detection . . . . . . . . . . . . . . . . . . . . . . . 99
4.4.4 Pronunciation Dictionary . . . . . . . . . . . . . . . . . . . 103
4.4.5 Accuracy and Smoothness Measurements . . . . . . . . . . 104
4.4.6 Overall quality measurement . . . . . . . . . . . . . . . . . 105
4.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . 106
4.6 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . 107
5 Summary and Future Work . . . . . . . . . . . . . . . . . . . . . . 108
5.1 Summary and Discussion . . . . . . . . . . . . . . . . . . . . . . . 108
vii
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
viii
List of Figures
1.1 Diagram of MFCC features extraction. . . . . . . . . . . . . . . . 8
1.2 Spectrograms of clean speech utterance from a male speaker (top)
saying two digits “eight two” and the same utterance corrupted
with additive white noise at 5 dB (bottom). . . . . . . . . . . . . 10
1.3 Spectrograms of an utterance from an adult speaker (top) say-
ing one digit “zero”, and the same sentence from a child speaker
(bottom). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.1 Diagram of MFCC features extraction without (Xc) and with (Y c)
frequency warping. . . . . . . . . . . . . . . . . . . . . . . . . . . 19
2.2 Mel-frequency filter banks and the approximation made in the lin-
earization of frequency warping in [30]: each triangular filter is
represented only with its central peak value (the circle point) . . . 22
2.3 Illustration of three levels of formant estimations (global, phoneme,
and state). Boundaries are obtained through force alignment:
dashed lines mark the boundaries of phonemes and dotted lines
mark the boundaries for states . . . . . . . . . . . . . . . . . . . 26
2.4 F3 warping factors for /IY/, /AA/, /UW/, and the global average
for 10 test speakers (6 male and 4 female adults) from RM1 . . . 27
2.5 F3 warping factors for /IY/, /AH/, /UW/, and the global average
for 10 test speakers (5 boys and 5 girls) from TIDIGITS . . . . . 28
ix
2.6 An example of regression tree using combined phonetic knowl-
edge and data-driven techniques for the phoneme-based approach.
Phonemes are firstly categorized based on phonetic knowledge, and
then further clustered according to their estimated F3 values. . . . 33
2.7 The speaker adaptation algorithm using regression-tree based spec-
tral peak alignment for both supervised and unsupervised adapta-
tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.8 Performance of VTLN, MLLR, G-PA, GPAA, PPAA and MPAA
using RM1 for supervised adaptation. . . . . . . . . . . . . . . . . 38
2.9 Performance of VTLN, MLLR, G-PA, GPAA, PPAA and MPAA
using TIDIGITS for supervised adaptation. . . . . . . . . . . . . . 39
2.10 Performance of MPAA, MLLR, PSAT and MLLR-SAT using RM1
for supervised adaptation. . . . . . . . . . . . . . . . . . . . . . . 43
2.11 Performance of MPAA, MLLR, PSAT and MLLR-SAT using TIDIG-
ITS for supervised adaptation. . . . . . . . . . . . . . . . . . . . . 44
3.1 Schematic model of vocal tract with acoustic coupling to the tra-
chea through the glottis (a) and the equivalent circuit model (b).
Adapted from [62]. . . . . . . . . . . . . . . . . . . . . . . . . . . 54
3.2 Spectrogram for the word boy from an eight-year-old girl. The
second subglottal resonance Sg2 for this speaker is at 1920 Hz. . 55
x
3.3 Illustration of the F2 discontinuity caused by Sg2. The bold solid
line corresponds to the most prominent spectral peak (F2), which
has a jump in frequency and a decrease in amplitude when F2 is
crossing the subglottal resonance Sg2. The dotted line represents
the Sg2 pole, which varies somewhat in frequency and amplitude
when F2 is nearby. The horizontal thin solid line represents the
Sg2 zero, which is roughly constant. Adapted from [62]. . . . . . 56
3.4 Illustration of the relative positions of vowel formants F1 (·), F2
(+) and F3 (x) and the subglottal resonances (Sg1, Sg2 and Sg3)
for an adult male speaker. For the vowels /i, I, E, æ/ F2 > Sg2, and
they are therefore [-back]. For the vowels /a, 2, O, U, u/ F2 < Sg2,
and they are therefore [+back]. Adapted from [67]. . . . . . . . . 58
3.5 An example of the detection algorithm. . . . . . . . . . . . . . . . 59
3.6 Example of the joint estimation method where F2 discontinuity
and E2 attenuation correspond to the same location (frame 38).
Eq. (3.3) is used to estimate Sg2. . . . . . . . . . . . . . . . . . . 62
3.7 Example when there is a discrepancy between locations of F2 dis-
continuity (not detectable) and E2 attenuation (at frame 51). The
average F2 value within the dotted box is then used as the Sg2
estimate. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
3.8 Comparison of Sg2 estimates for the two speakers in Table 3.2, top
panel for speaker 1 and bottom panel for speaker 2. . . . . . . . . 67
3.9 Average within-speaker Sg2 standard deviations and the COVs
against contents and repetitions. . . . . . . . . . . . . . . . . . . . 70
3.10 Cross-language within-speaker COV of Sg2 for 10 boys and 10 girls. 71
xi
3.11 Vowel formants F1 (·), F2 (+) and F3 (x) before and after VTLN
(in circles) and Sg2-based (in squares) warping for a nine-year-old
girl’s vowels. The lines ‘Sg1’, ‘Sg2’ and ‘Sg3’ are the reference
subglottal resonances from the same speaker as in Fig. 3.4. . . . . 73
3.12 Vowel formants F1 (·), F2 (+) and F3 (x) from the reference
speaker (Fig. 3.4) versus those from the test speaker (Fig. 3.11)
before and after warping (VTLN in circles, Sg2 in squares). The
dotted line is y = x which means perfect match between reference
and test speakers. . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
3.13 Speaker normalization performance on TIDIGITS with various
amount of adaptation data. . . . . . . . . . . . . . . . . . . . . . 75
3.14 Performance comparison of VTLN, F3 and Sg2D2 using one adap-
tation digit with various vowel content. . . . . . . . . . . . . . . 77
3.15 Piecewise bark shift warping function, where α > 0 shifts the Bark
scale upward, α < 0 shifts downward, and α = 0 means no warping. 83
4.1 Flowchart of the automatic evaluation system for the blending tasks. 97
4.2 An example of the disfluency detection network for a syllable
blending task word ‘peptic’, where START and END are the net-
work enter and exit points, respectively. . . . . . . . . . . . . . . . 99
xii
List of Tables
2.1 Word recognition accuracy using RM1 for supervised adaptation. 47
2.2 Word recognition accuracy using RM1 for unsupervised adaptation. 47
2.3 Word recognition accuracy using TIDIGITS for supervised adap-
tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.4 Word recognition accuracy using TIDIGITS for unsupervised adap-
tation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
2.5 Significant analysis of performance improvements of MPAA over
MLLR using RM1 for supervised adaptation . . . . . . . . . . . . 49
2.6 Significant analysis of performance improvements of MPAA over
MLLR using TIDIGITS for supervised adaptation . . . . . . . . . 49
3.1 Comparison of Sg2 estimates for two algorithms over various vowel
contents, where Sg2M is the manual measurement from speech
spectrum, and Sg2Acc is the ‘ground truth’ measurement from the
accelerometer signal. For each algorithm the average Sg2 estimates
(Hz) are shown (with standard deviations in parentheses). The two
speakers with a * are those used for calibration. . . . . . . . . . . 65
xiii
3.2 Detailed comparison of Sg2 estimates for the two algorithms on
two speakers. For vowels above the double line, there are no dis-
continuities in the F2 trajectory, and Sg2D1 uses the mean F2 as
Sg2 while Sg2D2 uses Eq. (3.2) ( ˜Sg2) to make an estimate; for
vowels below the double line, the F2 discontinuity is detectable,
and Sg2D1 uses Eq. (3.1) while Sg2D2 uses Eq. (3.3). The row
‘Avg.(std)’ shows the mean (and standard deviation) for each al-
gorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
3.3 Performance comparison (word recognition accuracy) on RM1 with
one adaptation utterance. . . . . . . . . . . . . . . . . . . . . . . 78
3.4 Performance comparison (word recognition accuracy) of VTLN
and Sg2 normalization using English (four words) and Spanish
(five words) adaptation data. The acoustic models were trained
and tested using English data. . . . . . . . . . . . . . . . . . . . . 80
3.5 WER on TIDIGITS using MFCC features with varying normal-
ization data from 1 to 15 digits. . . . . . . . . . . . . . . . . . . . 87
3.6 WER on TIDIGITS using PLPCC features with varying normal-
ization data from 1 to 15 digits. . . . . . . . . . . . . . . . . . . . 87
3.7 WER on TBall children’s data using MFCC and PLPCC features
with 3 normalization words. . . . . . . . . . . . . . . . . . . . . . 88
4.1 An example of the TBall blending tasks: audio prompts are pre-
sented and a child is asked to orally blend them into a whole word.
A one-second silence (SIL) is used within the prompts to separate
each sound. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
xiv
4.2 An example of the TBall segmentation tasks: audio prompts are
presented and a child is asked to orally segment them into parts. . 92
4.3 Target words for the blending tasks. . . . . . . . . . . . . . . . . . 93
4.4 Speaker distribution by native language and gender. . . . . . . . . 93
4.5 Average inter-evaluator correlation on pronunciation accuracy, smooth-
ness and overall evaluations on three blending tasks. . . . . . . . . 95
4.6 Pronunciation variants analysis for consonants and vowels on a
Spanish-accent English database, with the percentage of occur-
rence in the analysis database shown in the parentheses. Entries
with a tailing asterisk are those change patterns not predicted from
theories. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.7 Average correlation between ASR and teacher evaluations on pro-
nunciation accuracy, smoothness and overall qualities for three
blending tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
xv
Acknowledgments
I am deeply grateful to my advisor, Dr. Abeer Alwan for her intellectual guidance,
gracious support and encouragement throughout my study at UCLA. My sincere
gratitude also goes to Professors Richard D. Wesel, Paulo Tabuada, and Mark H.
Hansen for being on my doctoral committee and for their interests in my research.
I would also extend my gratitude to Dr. Xiaodong Cui, and Dr. Steven M.
Lulich for their insightful suggestions and valuable comments. I have greatly
benefited from collaborations with Dr. Cui and Dr. Lulich, and long discussions
with each of them directly lead to some ideas presented in this dissertation.
I am very thankful to my family - father, mother, my wife and daughter. They
have always been there by my side and encouraged me to do my best. Without
their love, support and encouragement, this dissertation work would not have
been possible.
Thanks go to my SPAPL labmates, Dr. Markus Iseli, Dr. Sankaran Pancha-
pagesan, Jonas, Yen, Wei, Gang, Harish, and many others for their important
helps and for their friendship. I would also thank all my friends for being there
for me over the years.
This work was supported in part by the NSF Grant No. 0326214, and parts
of the dissertation have been published in papers listed in publications.
xvi
Vita
1998–2002 B.S. (Electrical Engineering), Shandong University, China
2002–2005 M.S. (Electrical Engineering), Tsinghua University, China
2005–2010 PhD (Electrical Engineering), UCLA, Los Angeles, California.
Publications
S. Wang, A. Alwan and S. M. Lulich, “Automatic detection of the second sub-
glottal resonance and its application to speaker normalization,” J. Acoust. Soc.
Am., 126(6): 3268-3277, 2009.
S. Wang, P. Price, Y.-H. Lee and A. Alwan, “Measuring children’s phonemic
awareness through blending tasks,” in Proc. of SLaTE workshop 2009.
S. Wang, Y.-H. Lee and A. Alwan, “Bark-shift based nonlinear speaker normal-
ization using the second subglottal resonance,” in Proc. of Interspeech 2009, pp.
1619-1622.
S. Wang, A. Alwan and S. M. Lulich, “A reliable technique for detecting the
second subglottal resonance and its use in cross-language speaker adaptation,”
in Proc. of Interspeech 2008, pp. 1717-1720.
xvii
S. Wang, A. Alwan and S. M. Lulich, “Speaker normalization based on subglottal
resonances,” In Proc. of ICASSP 2008, pp. 4277-4280, 2008.
S. Wang, X. Cui and A. Alwan, “Speaker Adaptation with Limited Data using
Regression-Tree based Spectral Peak Alignment,” IEEE Trans. on Audio, Speech
and Language Processing, Vol. 15, pp. 2454-2464, 2007.
S. Wang, P. Price, M. Heritage and A. Alwan, “Automatic Evaluation of Chil-
dren’s Performance on an English Syllable Blending Task,” in Proc. of SLaTE
workshop 2007.
S. Wang, X. Cui and A. Alwan, “Rapid Speaker Adaptation using Regression-
Tree based Spectral Peak Alignment,” in Proc. of ICSLP, pp. 1479-1482, 2006.
xviii
Abstract of the Dissertation
Rapid Speaker Normalization and Adaptation
with Applications to Automatic Evaluation
of Children’s Language Learning Skills
by
Shizhen Wang
Doctor of Philosophy in Electrical Engineering
University of California, Los Angeles, 2010
Professor Abeer Alwan, Chair
This dissertation investigates speaker variation issues in automatic speech recog-
nition (ASR), with a focus on rapid speaker normalization and adaptation meth-
ods using limited enrollment data from the speaker. Investigations are carried
out in the direction of reducing spectral variations through frequency warping.
Two methods are developed, one based on the supraglottal (vocal tract) res-
onances (formants), and the other on resonances from subglottal airways. The
first method attempts to reshape (warp) the spectrum by aligning corresponding
formant peaks. Since there are various levels of variations in formant struc-
tures, regression-tree based phoneme- and state-level spectral peak alignment is
studied for rapid speaker adaptation using linearization of the vocal tract length
normalization (VTLN) technique. This method is investigated in a maximum
likelihood linear regression (MLLR)-like framework, taking advantage of both
the efficiency of frequency warping (VTLN) and the reliability of statistical es-
timations (MLLR). Two different regression classes are investigated: one based
xix
on phonetic classes (using combined knowledge and data-driven techniques) and
the other based on Gaussian mixture classes.
The second approach utilizes subglottal resonances, which has been shown to
affect spectral properties of speech sounds. A reliable algorithm is developed to
automatically estimate the second subglottal resonance (Sg2) from speech sig-
nals. The algorithm is calibrated on children’s speech data with simultaneous
accelerometer recordings from which Sg2 frequencies can be directly measured.
A cross-language study with bilingual Spanish-English children is performed to
investigate whether Sg2 frequencies are independent of speech content and lan-
guage. The study verifies that Sg2 is approximately constant for a given speaker
and thus can be a good candidate for limited data speaker normalization and
cross-language adaptation. A speaker normalization method is then presented
using Sg2.
As an application, ASR techniques are applied to automatically evaluate
children’s phonemic awareness through three blending tasks (phoneme blending,
onset-rhyme blending and syllable blending). The system incorporates speaker
normalization, disfluency detection and Spanish accent detection, together with
speech recognition to assess the overall quality of children’s speech productions.
xx
CHAPTER 1
Introduction
1.1 Automatic Speech Recognition
Automatic Speech Recognition (ASR) aims to decode an acoustic signal X =
X1X2 · · ·XT into a word sequence W = w1w2 · · ·wm, with the goal of making
W close to the original word sequence W. The most common criterion is to
maximize the posterior probability P (W|X) for the given observation X, that is:
W = arg maxw
P (W|X) (1.1)
Using Bayes’ theorem, the above equation can be written as:
W = arg maxw
P (W|X) = arg maxw
P (W)P (X|W)
P (X)= arg max
w
P (W)P (X|W)
(1.2)
where P (X) is omitted because its value is fixed for all the word sequences.
P (W) is the language model probability of the sequence W; while P (X|W) is
the acoustic model probability of generating the observation sequence X given
the acoustic model W.
1.1.1 Hidden Markov Models
The most widely used ASR acoustic models are hidden Markov models (HMM).
HMM [1] is a very powerful statistical method of characterizing the observed
1
data samples, with the assumption that the data samples can be well character-
ized as a parametric first-order Markov random process. An HMM is basically
a Markov chain, except that the output observation is probabilistically, instead
of deterministically, generated according to an output probability function asso-
ciated with each state, and thus there is no one-to-one correspondence between
the observation sequence and the state sequence.
Formally speaking, an HMM (λ) is defined by:
λ = (A,B, π) (1.3)
where:
• A = {aij} is a transition probability matrix, where aij is the probability of
transiting from state i at time t − 1 to state j at time t, i.e.,
aij = P (st = j|st−1 = i) (1.4)
• B = {bj(Xt)} is an output probability matrix, where bj(Xt) is the proba-
bility of emitting observation Xt given the state is j, i.e.,
bj(Xt) = P (Xt|st = j) (1.5)
The probability function bj(Xt) can be either a discrete probability mass
function (PMF) or a continuous probability density function (PDF).
• π = {πj} is an initial (at time t = 0) state distribution where
πj = P (s0 = j) (1.6)
There are three fundamental problems related to HMM [2]:
• Probability evaluation problem: Given a model λ and an observation se-
quence X, what is the probability P (X|λ)?
2
• State sequence decoding problem: Given a model λ and an observation
sequence X, what is the most likely state sequence S that generates the
observations?
• Parameter estimation problem: Given a model λ and a set of observa-
tions, how can one modify the parameters to maximize the joint probability∏X
P (X|λ)?
The first problem, the probability evaluation problem, can be efficiently solved
using the forward algorithm [3], which recursively calculates the forward proba-
bility defined as:
αt(i) = P (X t1, st = i|λ) (1.7)
which is the probability of being in state i at time t and generating the partial
observation X t1 = X1X2 · · ·Xt. αt(i) can be calculated recursively as:
αt(j) =∑
i
αt−1(i)aijbj(Xt) (1.8)
with initialization
α1(i) = πibi(X1) (1.9)
and termination
P (X|λ) =∑
i
αT (i) (1.10)
The second problem, the state sequence decoding problem, can be solved with
the Viterbi algorithm [4]. The Viterbi algorithm can be viewed as a modified for-
ward algorithm, where instead of summing up probabilities from different paths,
only the path with the highest probability is selected. Define φt(i) as the prob-
ability of the most likely state sequence at time t which generates the partial
observation X t1 and in state i, that is:
φt(i) = maxSt−1
1
P (X t1, S
t−11 , st = i|λ) (1.11)
3
which can be recursively calculated as:
φt(j) = maxi
[φt−1(i)aij
]bj(Xt) (1.12)
with initialization
φ1(i) = πibi(X1) (1.13)
and termination
V = maxi
φT (i) (1.14)
where V is the likelihood score of the best state sequence which can then be
obtained through backtracking.
The third problem, the parameter estimation problem, can be solved using the
Baum-Welch algorithm [3], also known as forward-backward algorithm. Similarly
to the forward probability, the backward probability is defined as:
βt(i) = P (XTt+1|st = i, λ) (1.15)
That is, βt(i) is the probability of generating the partial observation from time
t+1 to the end (XTt+1), given that the HMM is in state i at time t. The recursion
for βt(i) is:
βt(i) =∑
j
βt+1(j)aijbj(Xt+1) (1.16)
with initialization
βT (i) = 1 (1.17)
The HMM parameters can be iteratively refined by maximizing the likelihood
P (X|λ). According to the expectation maximization (EM) algorithm [5], the
maximization process is equivalent to maximizing the following auxiliary function:
Q(λ, λ) = ES|X,λ log P (S,X|λ) (1.18)
=∑S
P (S|X, λ) log P (S,X|λ) (1.19)
4
The maximization of Q(λ, λ) can be done by setting its derivative over λ to zero,
which has close-form solutions for both discrete output probability functions,
and continuous output probability functions where each PDF is represented as
Gaussian mixtures. For example, in the Gaussian mixture cases, we have:
bj(Xt) =
M∑k
wjkN (Xt, μjk,Σjk) (1.20)
where N (Xt, μjk,Σjk) denotes a single Gaussian density function with mean μjk
and covariance matrix Σjk for state j, and M is the number of Gaussian mixtures.
The weights wjk satisfy∑
k wjk = 1. The parameter re-estimation formulae are
as follows:
πi =
∑j γ1(i, j)∑
i
∑j γ1(i, j)
(1.21)
aij =
∑t γt(i, j)∑
t
∑j γt(i, j)
(1.22)
μjk =
∑t ζt(j, k)Xt∑
t ζt(j, k)(1.23)
Σjk =
∑t ζt(j, k)(Xt − μjk)(Xt − μjk)
t∑t ζt(j, k)
(1.24)
wjk =
∑t ζt(j, k)∑
t
∑k ζt(j, k)
(1.25)
where γt(i, j) is the probability of taking the transition from state i to state j at
time t, given the observation X:
γt(i, j) = P (st−1 = i, st = j|X, λ) (1.26)
=P (st−1 = i, st = j,X|λ)
P (X|λ)(1.27)
=αt−1(i)aijbj(Xt)βt(j)∑
i αT (i)(1.28)
5
and ζt(j, k) is the probability of being in state j and Gaussian mixture k at time
t, given the observation X:
ζt(j, k) = P (st = j, ξt = k|X, λ) (1.29)
=∑
i
αt−1(i)aijwjkbjk(Xt)βt(j) (1.30)
1.1.2 Feature Extraction
According to the linear production theory, a speech signal can be viewed as a
source signal going through a linear filter. The source represents the air flow at
the vocal cords, which is a periodic signal for voiced sounds (with the inverse of
the period known as fundamental frequency F0), and aperiodic noise for unvoiced
sounds. The filter represents the resonances (poles, also known as formants)
and anti-resonances (zeros) of the vocal tract. The characteristics of the filter
provides more discriminative information about phoneme sounds, and thus are
heavily relied on for sound classification either by human or by machine.
Feature extraction decomposes source and filter functions, and parameterizes
raw speech signals into sequence of feature vectors. Commonly-used features
include linear predictive cepstral coefficients (LPCC) [6] , perceptual linear pre-
diction (PLP) [7] and Mel-frequency cepstral coefficients (MFCC) [8]. MFCCs
are most widely used because of robust performance across various conditions.
Some common pre-processing operations are typically applied before perform-
ing feature extractions. First, pre-emphasis is applied to boost high frequencies
through the first-order difference equation
s′n = sn − k sn−1 (1.31)
where sn is the speech samples and k is the pre-emphasis coefficient in the range
0 ≤ k < 1. Since speech signals are non-stationary, the signals are then segmented
6
into short frames (about 20ms long) and processed frame by frame. To attenuate
discontinuities at the window boundaries, a Hamming window is usually applied
to taper the samples in each window.
MFCC is defined as the real cepstrum of a windowed signal derived from
its Fourier transform through a Mel-frequency filter bank, which approximates
the frequency resolution of the auditory system. Fig. 1.1 shows a diagram for
computing MFCC features. Discrete Fourier transform (DFT) is performed to
each frame of windowed signal to transform the time domain signal into fre-
quency domain. A triangular Mel-frequence filter bank is then applied to the
DFT magnitude, followed by logarithmic operations to compress the dynamic
range. Discrete cosine transform (DCT) is then performed to decorrelate the
log output. First- and second-order derivatives are usually appended to the raw
MFCC features to account for spectral dynamics.
1.1.3 Acoustic Modeling
Acoustic modeling is critical to ASR’s performance and is arguably the central
part of an ASR system. According to Eq. (1.2), the challenge of acoustic mod-
eling is to build accurate P (X|W), which should take into account all possible
variabilities such as speaker variations, pronunciation variations, environmental
variations, etc.
After converting raw speech signals into feature vectors, a training process is
performed to learn the acoustic characteristics of the training data, resulting in
a set of acoustic models. The most successful models are HMM, as described in
Section 1.1.1. Depending on the amount of training data, an ASR system can
use discrete, continuous, or semi-continuous HMMs. When the training data is
sufficient, a continuous model offers the best recognition accuracy, while discrete
7
Pre-emphasis
Hamming windowing
DFT magnitude
Mel-frequency filter bank
Logarithm
DCT
Speech signal
MFCC feature
Figure 1.1: Diagram of MFCC features extraction.
models are more effective for small amounts of data.
The model unit can be either a whole-word model or a sub-word model, e.g.,
phonetic model, dependent on the size of the recognition task. For small vocab-
ulary and isolated word tasks, whole-word models provide better performance;
while sub-word unit model is more flexible and robust for large vocabulary con-
tinuous tasks.
Depending on the source of the training data, an ASR system can be ei-
ther speaker-dependent (SD) or speaker-independent (SI) [9]. A SD system can
achieve high recognition accuracy, but requires a large amount of training data
from the target speaker, and therefore is difficult to generalize to new speakers.
8
On the other hand, a SI system is more flexible and easier to adapt to a new
speaker, though the performance is usually not as good as a SD system trained
for a specific speaker.
This dissertation focuses on speaker independent continuous speech recogni-
tion using continuous HMMs.
1.1.4 Robustness Issues
High accuracy and robustness is the ultimate goal of ASR systems. Due to great
variabilities in speech signals, today’s state-of-the-art ASR systems are still far
from matching human’s performance. It is still a challenge to build an ASR
system that can accurately recognize anyone’s speech, on any topic, and in any
speaking environment. Variations in environmental conditions, especially noise
and microphone variations, and speaker variabilities such as gender, age, speaking
rate, and accent etc., can greatly degrade ASR’s performance. For example, an
ASR system trained on clean speech degrade significantly on real-world noisy
speech; and an ASR system trained on adult males’ speech performs about 10%
relatively worse on females’ speech; and the performance on children’s speech is
even worse.
The performance degradation is caused by mismatch between training and
test data due to environmental and/or speaker variations. Such variations can
dramatically change the characteristics of speech signals, as illustrated in Figs.
1.2 and 1.3, which show the spectrogram of speech under clean versus noisy
conditions, and speech from an adult male versus a child, respectively. These
mismatches cause ASR’s performance to fluctuate in various noise conditions, or
from speaker to speaker. This dissertation addresses speaker variation issues.
9
time (s)
freq
uenc
y (k
Hz)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
1
2
3
4
time (s)
freq
uenc
y (k
Hz)
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.90
1
2
3
4
Figure 1.2: Spectrograms of clean speech utterance from a male speaker (top)
saying two digits “eight two” and the same utterance corrupted with additive
white noise at 5 dB (bottom).
1.2 Speaker Normalization and Speaker Adaptation
Inter-speaker acoustic variations are mostly caused by differences in the vocal
tract and vocal fold apparati. Typically, adult females have shorter vocal tract
lengths (VTL) and smaller vocal cords than adult males, while children have
shorter VTLs and smaller vocal cords than adults [10]. This implies, according
to the linear speech production theory [11], that children have higher formant
and fundamental (F0) frequencies than adults, and female adults have higher
formants and F0 than male adult speakers. Consequently, the performance of
speech recognition systems may be significantly different from speaker to speaker.
To maintain robust recognition accuracy, speaker adaptation and speaker nor-
10
time (s)
freq
uenc
y (k
Hz)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
1
2
3
4
time (s)
freq
uenc
y (k
Hz)
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.80
1
2
3
4
Figure 1.3: Spectrograms of an utterance from an adult speaker (top) saying one
digit “zero”, and the same sentence from a child speaker (bottom).
malization techniques are usually applied to reduce spectral mismatch between
training and testing utterances [12–20]. Speaker adaptation attempts to compen-
sate for spectral mismatch in the back-end acoustic model domain by statistically
tuning the acoustic models to a specific speaker [12–15]. Speaker normalization,
or vocal tract length normalization (VTLN), on the other hand, aims at reducing
the effects of vocal tract variability in the front-end feature domain via linear,
piece-wise linear or bilinear frequency warping [16,17]. Other frequency warping
functions have also been studied [18–20]. A class of transforms, known as all-
pass transforms (APTs), was proposed to perform VTLN in [19] and studied in
detail in [20] for two classes of conformal maps, namely rational all-pass trans-
forms (RAPTs) and sine-log all-pass transforms (SLAPTs). It was demonstrated
that using multiple-parameter warping functions is more effective than single-
11
parameter ones [20]. In speaker adaptation, parameters are speaker-specific trans-
formation matrices and biases estimated using the maximum likelihood (ML) or
maximum a posterior (MAP) criterion [14, 21]. In VTLN, the parameters to be
estimated are the frequency warping factors. Hence, to make reliable statisti-
cal estimation of adaptation parameters, speaker adaptation methods generally
require more adaptation data than VTLN.
Considerable research efforts have been devoted to the relationship between
frequency warping in the feature domain and the corresponding transformations
in the model domain [22–30]. For computational efficiency, several studies have
proposed the possibility of directly performing VTLN in the back-end model do-
main. In [22], vocal tract length normalization was implemented in an MLLR
framework. Claes et al. in [23] proposed a linear approximation of VTLN for
reasonably small warping factors using Taylor expansion. In [24] and [25], Mc-
Donough et al. derived the linearity of VTLN in cepstral space for two all-pass
transforms (rational all-pass transforms and sine-log all-pass transforms) and con-
ducted in [26] a detailed performance comparison with MLLR on a large vocab-
ulary database. In [27], Pitz and Ney showed that, in the continuous frequency
space, VTLN is equivalent to a linear transformation in the cepstral domain
for MFCCs with Mel-frequency warping (instead of Mel-frequency filter banks).
Umesh et al. in [28] showed that this VTLN linearization also holds in the dis-
crete frequency space under the assumption of strictly limited quefrency range in
the cepstral domain. Cui and Alwan in [29] and [30] discussed in detail the lin-
earization of frequency warping for several different feature extraction schemes.
Under certain approximations, they showed that frequency warping of MFCC
features with Mel-frequency filter banks equals a linear transformation in the
model domain.
12
1.2.1 Linear Frequency Warping
VTLN is one of the most popular methods for reducing the effects of speaker-
dependent vocal tract variability through a speaker-specific frequency warping
function (linear, bilinear, or piece-wise linear) [16, 17, 20, 25, 27, 28]. Warping
factors are typically estimated based on the maximum likelihood (ML) criterion
over the adaptation data through an exhaustive grid search or warping-factor
specific models [16,17]. Linear frequency warping can be implemented directly in
the power spectrum domain or in the cepstral domain through the linearization
of VTLN [25, 27, 28]. Along with the linearization of VTLN, the warping factor
can be estimated using the Expectation Maximization (EM) algorithm with an
auxiliary function [25]. Other frequency warping functions have also been studied
in [20].
Another way to reduce spectral variability is to explicitly align spectral for-
mant positions or formant-like spectral peaks, especially the third formant (F3),
and to define the warping factors as formant frequency ratios [18, 23, 30–34]. In
formant-based frequency warping methods, formant positions of different speakers
are transformed into a normalized frequency space. The authors in [18] proposed
a nonlinear warping function based on a parameter estimated using F3 frequency,
while [32] extended this formant-based algorithm and compared the performance
with ML-based methods. In [31], the performance of frequency warping using the
first three formant frequencies was explored. A linear approximation of VTLN
was proposed in [23] for reasonably small warping factors estimated based on
average F3 values. Cui and Alwan proposed a novel spectral formant-like peak
alignment method, with a focus on F3, to reduce spectral mismatch between
adults and children’s speech [30, 33]. Based on the idea of frequency transfor-
mation for digital filters, the authors in [34] treated formant structures as filters
13
and developed a bilinear transform with parameters estimated using average F3
frequency and bandwidth values.
However, due to coarticulation, clarity, speed and other factors, formant fre-
quencies vary considerably within an utterance, and thus make the performance
of formant normalization content dependent.
1.2.2 Maximum Likelihood Linear Regression
Maximum likelihood linear regression (MLLR) [14, 15] estimates a set of linear
transformations for the mean and variance parameters of a Gaussian mixture to
reduce the mismatch between the initial model set and the adaptation data. The
linear transformation of the mean is defined as
μ = Wξ (1.32)
where W is the transformation matrix, and ξ is the extended mean vector,
ξ = [w μ1 μ2 · · · μn]T (1.33)
where w = 1 represents a bias offset. W can be decomposed into
W = [b A] (1.34)
and hence
μ = A μ + b (1.35)
where A is the transformation matrix and b is a bias vector.
The variance adaptation is given in the form:
Σ = BT H B (1.36)
where H is the transformation matrix to be estimated, and B is the inverse of
the Cholesky factor of Σ−1
Σ−1 = CCT (1.37)
14
and
B = C−1 (1.38)
The transformation matrices can be obtained by solving an auxiliary function
using the Expectation Maximization (EM) technique,
QN (λ, λ) =∑j,k
∑t
ζt(j, k) logN (o(t); μjk; Σjk) (1.39)
where ζt(j, k) is defined in Eq. (1.30), and N (o(t); μjk; Σjk) is the kth Gaussian
mixture of state j.
1.3 Children’s Speech Recognition
Children’s speech analysis and recognition have drawn increasing attention for
educational purposes, and more efforts have been devoted to ASR’s applications
using children’s speech [35–40]. Speech technology has been applied in auto-
mated language and literacy tutoring to measure children’s language learning
skills, and to assess their reading and listening comprehension. Children’s speech
recognition, however, still remains challenging. Due to developmental changes in
vocal tract and vocal ford apparati, children’s speech demonstrates high acoustic
variabilities, which makes children’s ASR more challenging compared to adults’
ASR [41, 42]. The performance of an ASR system trained using adult speech
degrades drastically when employed to recognize children’s speech. Furthermore,
recognition performance for children is usually lower than that achieved for adults
even when using a recognition system trained on children’s speech [43].
Disfluency is also an important issue in children’s speech recognition. As
part of the learning process, disfluencies such as repetitions, false starts, and
self-repairs, etc. often occur in young children’s speech. It was found that mis-
15
pronunciations and partial words repetition account for more than 30% of word
errors made by the speech recognizer [44]. Many approaches have been proposed
to detect disfluency in spontaneous speech. For example, a decision model was
applied in [45] using prosodic features, while [46] studied the combination of
multiple knowledge sources including acoustic-prosodic features, language mod-
els and rule-based knowledge. A disfluency-specialized grammar structure was
applied in [47] to detect disfluent reading miscues; while [48] proposed an efficient
hybrid word/subword unit recognition system which was shown to work well on
children’s speech.
In addition, accents present another challenge for ASR. If a child speaks more
than one language, his/her speech can present various level of pronunciation vari-
ations, e.g., different pronunciations of a phoneme, phoneme insertion/deletion
or substitution. Such accented speech can also degrade ASR’s performance. Ac-
cent detection/classification provides a way to improve performance on accented
speakers, since, with knowledge of a speaker’s accent, specific modeling strate-
gies can be applied to better target his/her individual acoustic characteristics.
Studies on accent detection employ either feature-based or model-based meth-
ods, or the combination of them. In [49], GMM with MFCC features was used to
classify four Chinese accents (dialects) of Mandarin speech. Decision trees was
built in [50] to detect accent levels of Japanese-accented English using prosodic
features like duration and pitch (F0). The authors in [51] proposed to use parallel
phoneme recognizers followed by phoneme language models (phoneme transition
probabilities). It has been shown that foreign accents can be successfully de-
tected using GMM classifiers, neural networks or phone recognizers with acoustic
and/or prosodic features.
16
1.4 Organization of the Dissertation
The dissertation is organized as follows. Chapter 2 presents a rapid speaker
adaptation method using regression-tree based spectral peak alignment. Chap-
ter 3 analyzes the variabilities of subglottal resonances, and proposes an efficient
speaker normalization and cross-language adaptation algorithm based on the sec-
ond subglottal resonance. Chapter 4 applies ASR on children’s speech to evaluate
their langauge learning skills, and addresses pronunciation, accent and disfluency
issues. Chapter 5 summarizes the dissertation and discusses future work.
17
CHAPTER 2
Regression-Tree based Spectral Peak Alignment
Spectral mismatch between training and testing utterances can cause significant
performance degradation in ASR systems. One way to reduce spectral mismatch
is to reshape the spectrum by aligning corresponding formant peaks. In this
chapter, regression-tree based phoneme- and state-level spectral peak alignment is
proposed for rapid speaker adaptation using linearization of the VTLN technique
in a MLLR-like framework, taking advantage of both the efficiency of frequency
warping (VTLN) and the reliability of statistical estimations (MLLR).
2.1 Frequency Warping as Linear Transformation
Frequency warping is usually applied in the linear spectral domain, which is very
time-consuming. To make it more efficient, this section derives the approximated
linear transformation in the cepstral domain, for frequency warping applied to
MFCC features.
2.1.1 Frequency Warping for MFCC
Most state-of-the-art ASR systems utilize MFCC features. Fig. 2.1 shows the
extraction of MFCC features with and without frequency warping. To derive the
relationship between warped and unwarped MFCC features, matrix operations
18
Mel-scaledfilter bank
FrequencyWarping
log DCT
Mel-scaledfilter bank
log DCT
PowerSpectrum
lS cX
cY
Figure 2.1: Diagram of MFCC features extraction without (Xc) and with (Y c)
frequency warping.
will be used to represent the feature extraction. Let Sl, an l × 1 column vector,
denote the magnitude spectrum of length l (usually l = 256 for 8k Hz speech
signals, and l = 512 for 16k Hz signals), which is calculated from the Fourier
transform of a frame of input speech signals; FB, an n × l matrix, be a Mel-
frequency filter bank, where n is the number of filter bank channels (usually
26), and each row the FB matrix represent one Mel-frequency filter; log(·) be an
element-wise logarithm operation; C, an m×l matrix, be the DCT matrix, where
m is the number of cepstral coefficients (the number of static MFCC features,
usually 13); W, an l× l matrix, denote the discretized frequency warping matrix;
Xc, an m × 1 column vector, denote the cepstral coefficients (MFCC features)
of the original unwarped speech signal, and Yc be the cepstral coefficients after
frequency warping. We have:
Xc = C · log(FB · Sl
)(2.1)
Yc = C · log(FB · W · Sl
)(2.2)
To relate the warped cepstrum Yc with the unwarped one Xc, we need to
explicitly derive Sl in terms of Xc, i.e., to recover the magnitude spectrum from
the cepstrum. Unfortunately, exact recovery of Sl from Xc is not possible, be-
19
cause the Mel-frequency filter bank FB, which is usually a wide matrix with size
n � l, is not invertible. Therefore, strictly speaking, there is no simple linear
relationship between Yc and Xc due to the non-invertibility imposed by the Mel-
frequency filter banks, however, some reasonable approximations exist [23,28,29].
These approximations produce acceptable ASR performances with less compu-
tational cost than performing the warping directly in the spectral space as Eq.
(2.2).
2.1.2 Approximated Linearization of Frequency Warping
The approximation proposed in [30] is applied to derive the linearization of fre-
quency warping for MFCC features. This approach is based on the concept of
an index mapping matrix. An index mapping matrix M has only one non-zero
element (with value equal to 1) in each row. A multiplication of an index map-
ping matrix M with any matrix X results in a row-permuted (index re-mapped)
version of the original matrix X, and thus comes the name “index mapping”.
Obviously, the multiplication of two index mapping matrixes is still an index
mapping matrix. Another property of index mapping matrixes, which will be
used later in the derivation of the linearization of frequency warping, is that the
multiplication operation of an index mapping matrix is interchangeable with any
element-wise operations, e.g.,
log(M · X) = M · log(X) (2.3)
As will be shown in Section 2.1.3, the warping matrix for a monotonic frequency
warping function is an index mapping matrix.
The approximation in [30] is to simplify the Mel-frequency filter bank by using
only the central peak value to represent each triangular Mel-frequency filter, as
shown in Fig. 2.2. That is, to retain only the nonzero central value for each row
20
in the Mel-frequency filter-bank matrix, and to set all other entries of the row to
zero. Under such an approximation, the original Mel-frequency filter bank matrix
FB becomes an index mapping matrix referred to as FB. To recover the linear
spectrum Sl from the Mel spectrum, interpolations are applied to build an index
mapping matrix F∗B, where unseen samples are generated by repeating neighbor
samples, such that F∗B· FB = I. Thus, from Eq. (2.1), we have
Sl ≈ F∗B· exp
(C−1 · Xc
)(2.4)
where exp(·) is element-wise exponential operation, and C−1 is the inverse DCT
matrix. Together with Eq. (2.2), we have:
Yc = C · log(FB · W · Sl
)(2.5)
≈ C · log(FB · W · F∗
B· exp
(C−1 · Xc
))(2.6)
= C · FB · W · F∗B· log
(exp
(C−1 · Xc
))from Eq. (2.3) (2.7)
= C · FB · W · F∗B· C−1 · Xc (2.8)
Therefore, under such approximations, frequency warping for MFCC features
can be implemented as a linear transformation in the cepstral domain, i.e.,
Yc ≈ A · Xc (2.9)
where
A = C · FB ·W · F∗B· C−1 (2.10)
A more detailed derivation can be found in [30].
In ASR systems, both static and dynamic features are used. From Eq. (2.9),
it is straightforward to show that dynamic features also hold this linearity, i.e.,
ΔYc ≈ A · ΔXc (2.11)
Δ2Yc ≈ A · Δ2Xc (2.12)
where Δ and Δ2 represent the first and second order derivatives, respectively.
21
0 500 1000 1500 2000 2500 3000 3500 40000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
frequency (Hz)
Figure 2.2: Mel-frequency filter banks and the approximation made in the lin-
earization of frequency warping in [30]: each triangular filter is represented only
with its central peak value (the circle point)
2.1.3 Definition of the Frequency Warping Matrix
For a given frequency warping function gα(f), the discretized frequency warping
matrix W in Eq. (2.10) is defined as
wij =
⎧⎪⎨⎪⎩
1, if i=round(gα(j))
0, otherwise
(2.13)
where i and j are the frequency sample indices. In real applications, the warping
function gα(f) is a monotonic function, and thus the warping matrix W is an
index mapping matrix.
Note that for mathematical tractability, simple warping functions such as
linear, piece-wise linear, bilinear or quadratic functions are generally applied to
22
perform frequency scaling in speaker normalization. In the derivation of the
transformation matrix A (Eq. (2.10)), however, no assumptions are made on the
warping function, i.e., gα(·) can be any reasonable monotonic mapping function.
2.1.4 Linear Frequency Warping Functions
A popular and effective warping function used in VTLN is a simple linear warping
function as:
gα(f) = α · f (2.14)
where α is the warping factor estimated from enrollment data.
With a linear warping function as in Eq. (2.14), the frequency range of the
resulting spectrum differs from the original one. To preserve the entire band-
width after warping, piece-wise linear or even nonlinear warping functions can be
applied, such that the boundary frequencies are always mapped into themselves.
An example of a piece-wise linear warping function is
gα(f) =
⎧⎪⎨⎪⎩
α · f, 0 � f � fu
fmax−αfu
fmax−fu(f − fu) + αfu, fu < f � fmax
(2.15)
where fmax is the maximum frequency in the spectrum, and fu is an empirically
chosen upper ‘cutoff’ frequency where the warping function deviates from α.
Preliminary experimental results didn’t show significant improvement using piece-
wise linear over linear functions. So only the linear warping function is used in
this work.
23
2.2 Alignment of Spectral Peaks
The warping factor α is usually estimated using maximum likelihood criterion.
Another approach is to explicitly align spectral formant positions or formant-like
spectral peaks, and to define the warping factor as formant frequency ratios. It
was shown in [29] and [30] that aligning only the third formant (F3) offers best
ASR performance. In other words, gα(f) is a linear warping function to align
formant-like peaks in the spectrum space:
gα(f) = α · f (2.16)
where
α =F3, new speaker
F3, standard speaker(2.17)
2.2.1 Choice of the Reference Speaker
The reference standard speaker, chosen to represent the acoustic characteristics
of the entire training set, is one of the training speakers who yields the highest
likelihood in the training stage. Since formant frequencies are gradually changing
from frame to frame, the median values of F3 over all voiced segments are used
for each speaker in Eq. (2.17). Since F3 has been shown to highly correlate to
speaker’s vocal tract length [11], this F3 peak alignment is related to vocal tract
length normalization.
Another choice for the reference standard speaker is to choose the speaker
who has a neutral warping factor (α closest to 1). That is, to define the reference
standard speaker as the one with F3 closest to the mean F3 value over the training
set, or the one with the median F3 value. Experiments with such a choice for
the standard speaker resulted in slightly worse performance, partially due to the
fact that by using the speaker with the highest training likelihood, we explicitly
24
transform the acoustic parameters of each speaker toward a higher likelihood
space.
2.2.2 Levels of Mismatch in Formant Structure
As mentioned before, spectral mismatch is a major reason for performance degra-
dation. There are various levels of mismatch in the formant structures, e.g.
global average, phoneme level and state level. Fig. 2.3 illustrates formant esti-
mation at these three levels. Global average formants are estimated using all the
voiced segments (including vowels and voiced consonants) from the speech data;
phoneme-level formants are estimated using speech segments for each phoneme;
and state-level formants are estimated using segments in that state.
To illustrate different mismatch levels, we calculated the global average and
phoneme-level F3 warping factors for each test speaker from the RM1 and TIDIG-
ITS database (see Section 2.6.1 for detailed experimental settings). F3 values
were estimated using 10 adaptation utterances (digits) for each speaker. For the
phoneme-level F3 warping factors, we compared the three vowels in the classic
vowel triangle: front vowel /IY/, mid vowel /AA/ (/AH/ for TIDIGITS, since
there was no /AA/ in the data) and back vowel /UW/. The reference standard
speaker for RM1 was a male adult with an average F3 of 2524Hz and the F3 values
for /IY/, /AA/ and /UW/ were 2951Hz, 2354Hz and 2143Hz, respectively. For
TIDIGITS, the reference speaker was a male with an average F3 of 2537Hz and
phoneme-level F3 of 2968Hz, 2457Hz and 2268Hz for /IY/, /AH/ and /UW/,
respectively. The global average F3 warping factor was calculated according to
Eq. (2.17), and the phoneme-level F3 warping factor was defined in a similar
way:
α =F3, /ph/ from new speaker
F3, /ph/ from standard speaker
(2.18)
25
where F3, /ph/ is the phoneme-level F3 value for phoneme /ph/.
.........
.........sentence
frames
phonemes
HMMstates
Gaussianmixtures
global
phonemelevel
state level
/IY/ /AA/.........
Figure 2.3: Illustration of three levels of formant estimations (global, phoneme,
and state). Boundaries are obtained through force alignment: dashed lines mark
the boundaries of phonemes and dotted lines mark the boundaries for states
Figures 2.4 and 2.5 show the global average and phoneme-level F3 warping
factors. From these figures, it can be seen that the adult-to-adult warping factors
(as in RM1, Fig. 2.4) are in the range of [0.96, 1.08], while the child-to-adult
warping factors (as in TIDIGITS, Fig. 2.5) are in the range of [1.12, 1.26].
This is consistent with the fact that children’s formant frequencies are higher
than adults’ [11]. Compared to Fig. 2.4, the warping factors in Fig. 2.5 show
more dramatic changes from speaker to speaker. This agrees with the observation
in [41] that children’s speech demonstrate larger inter- and intra- speaker spectral
variations than adults’ speech.
More importantly, these figures illustrate that phoneme-level warping factors
may be very different from the global average and different phonemes may have
different warping factors. For example, warping factors for /UW/ are around
26
1 2 3 4 5 6 7 8 9 100.96
1
1.04
1.08
Testing speaker id number
F3 w
arpi
ng fa
ctor
Front vowel /IY/Mid vowel /AA/Back vowel /UW/Global average
Figure 2.4: F3 warping factors for /IY/, /AA/, /UW/, and the global average
for 10 test speakers (6 male and 4 female adults) from RM1
1.0 for the adult-to-adult case and around 1.15 for the child-to-adult case; while
the warping factors for /IY/ have a larger dynamic range. Thus, if phoneme-
level or even lower state-level (instead of global average) warping factors are
used to reduce spectral mismatch, we can expect better performance. This is
the motivation for our proposed regression-tree based phoneme- and state-level
spectral peak alignment methods in Section 2.41.
1The amount of available adaptation data is another issue and is addressed in Sections 2.4and 2.5
27
1 2 3 4 5 6 7 8 9 101.12
1.16
1.2
1.24
1.28
Testing speaker id number
F3 w
arpi
ng fa
ctor
Front vowel /IY/Mid vowel /AH/Back vowel /UW/Global average
Figure 2.5: F3 warping factors for /IY/, /AH/, /UW/, and the global average
for 10 test speakers (5 boys and 5 girls) from TIDIGITS
2.3 Speaker Adaptation Using Spectral Peak Alignment
The linearity in Eqs. (2.9), (2.11) and (2.12) bridges the gap between the front-
end feature domain and the back-end model domain techniques and thus provides
an efficient way of frequency warping. It can be used to perform rapid speaker
adaptations on HMM Gaussian mixtures with mean μ and diagonal covariance
Σ in an MLLR-like manner [14]:
μ = Aμ + b (2.19)
Σ = BHBT (2.20)
where μ and Σ are the transformed mean vector and covariance matrix, B is the
inverse of the Cholesky factor of the original covariance Σ−1. The bias vector
28
b in Eq. (2.19) and the covariance transformation matrix H in Eq. (2.20) are
statistically estimated from the adaptation data under the maximum likelihood
criterion,
b =
{∑j,k
T∑t=1
ζt(j, k)Σ−1jk
}−1{∑j,k
T∑t=1
ζt(j, k)Σ−1jk (o(t) − Aμjk)
}(2.21)
H =
∑j,k
{(B−1
jk )T
[T∑
t=1
ζt(j, k)(o(t) − μjk)(o(t) − μjk)T
](B−1
jk )
}
∑j,k
T∑t=1
ζt(j, k)
(2.22)
where T is the number of frames of the adaptation data, and j and k are the
indices of state and mixture sets, respectively. ζt(j, k) is the posterior probability
of being at state j mixture k at time t given the observation o(t). By setting the
off-diagonal terms of H to zero, the adapted covariance Σ is also diagonal.
Unlike the statistical estimation in MLLR, the transformation matrix A here
is generated deterministically based on Eq. (2.10), which depends only on the
warping factors; while in MLLR a full or block-diagonal A needs to be statistically
estimated. This would result in many more parameters than the deterministically
generated A in Eq. (2.10). Though more parameters are powerful to capture
slight differences among speakers, they may also lead to unreliable estimations
(and thus unsatisfactory performance) with limited adaptation data. The A
matrix generated using Eq. (2.10), however, can be more reliable than in MLLR
when the amount of adaptation data is small; while the statistically estimated
bias b and covariance transformation matrix H can benefit from increasing the
amount of adaptation data. Hence, this peak alignment adaptation method can
perform well for varying amounts of adaptation data.
29
2.4 Regression-tree based Speaker Adaptation
2.4.1 Global vs. Regression-tree based Peak Alignment
Several different approaches can be applied to perform formant-like peak align-
ment adaptation. In [30], speaker adaptation was employed as a global peak
alignment, i.e. to estimate the average F3 over all the adaptation data and gen-
erate the transformation matrix A according to Eq. (2.10) with the same warping
factor (Eq. (2.17)) for all model units. When performing adaptation, all means
of the HMM parameters share the same transformation A.
On the one side, since there is only one parameter (α) to be estimated, this
global method has the potential of good performance for limited adaptation data.
On the other side, however, a simple global warping factor can’t take advantage
of increasing adaptation data, for which fine and detailed modeling abilities are
required.
Another argument is that, as shown in Section 2.2.2, there are various lev-
els of mismatch in formant structures. Using only one global average warping
factor may not reduce the spectral mismatch uniformly for all phonemes. Since
different phonemes may have different warping factors, we can use phoneme- or
state-level warping factors to perform adaptation. This is the basic idea for the
regression-tree based spectral peak alignment adaptation, i.e. to align similar
(close in acoustic space) components in a similar way. This extension from global
to regression-tree based peak alignment is similar to the expansion of MLLR
from a global transform to many transforms especially when the adaptation data
increase.
Tow methods are investigated to define regression classes: phoneme-based
(using phoneme-level formants) and Gaussian mixture-based (using state-level
30
formants). The following sections will discuss these two methods in details.
2.4.2 Phoneme based Regression Tree
In this phoneme-based method, regression classes (units) are classified based on
phonetic knowledge and/or data-driven methods. For example, according to pho-
netic knowledge, phonemes can first be categorized into vowels and consonants,
and then consonants can be further classified as voiced or unvoiced; vowels can
further be clustered according to their phoneme-level F3 values using data-driven
methods. All model parameters for phoneme units with similar acoustic char-
acteristics (phoneme-level formants) are placed together in the same regression
class. Preliminary experiments showed that phonetic knowledge offers better per-
formance when adaptation data are limited to less than 5 utterances, while the
data-driven approach is superior when more data are available. Therefore, we
chose to combine the two techniques.
Figure 2.6 shows an example of a regression tree based on tied phonetic knowl-
edge and data-driven methods with eight base classes (terminal nodes) denoted
as {2, 3, 4, 6, 7, 8, 9, 10}. Each phoneme belongs to one specific base class. During
adaptation, the number of base classes is dynamically created depending on the
amount of adaptation data. Since unvoiced consonants have no clear formant
structure in their spectra, the transformation matrix A for unvoiced consonants
is determined by the average F3 over all voiced consonants in the adaptation
data.
2.4.3 Gaussian Mixture based Regression Tree
Since formant frequencies are gradually changing from frame to frame, it may
be helpful to use further lower level formants in adaptations. In HMM models,
31
each phoneme unit has several states, and states are represented with Gaussian
mixtures. Hence, we can consider state-level formants, and define the regression
tree based on Gaussian mixtures of the states. In this method, Gaussian mixture
components (means and covariances) are clustered based on a measure of simi-
larity. In each class, the state-level F3 is estimated and averaged, and spectral
peaks are then aligned with the same warping factor. Similar to global average
and phoneme-level F3 warping factors (Eq. (2.17) and (2.18)), state-level warping
factor is defined as
α =F3, state m of /ph/ from new speaker
F3, state m of /ph/ from standard speaker
(2.23)
where F3, state m of /ph/ is the state-level F3 value of state m in phoneme /ph/.
For both phoneme-based and Gaussian mixture-based methods, regression
trees are constructed based on the speaker independent training data and is
independent of new speakers. The tree is constructed with a centroid splitting
algorithm using a Euclidean distance measure. Each terminal node (base class) of
the tree specifies a particular component groupings: phonemes for the phoneme-
based regression tree and states for the Gaussian mixture-based regression tree.
The following sections will evaluate and compare the performance of these differ-
ent approaches of peak alignment adaptation (PAA).
2.5 Integration of Peak Alignment with MLLR
As will be shown in the next section, when adaptation data are limited, both ap-
proaches of PAA, namely phoneme-class and Gaussian mixture-class based, work
well. With few parameters to estimate, PAA can handle one of the limitations
of MLLR: unreliable parameter estimation for limited data. The performance of
PAA, however, tends to saturate when more adaptation data become available,
32
P h o n e m e s
V o w e l s
M o n o t h o n g s D i p h t h o n g s
C o n s o n a n t s
V o i c e d U n v o i c e d
21 6543
87 1 0 9
P h o n e t i c K n o w l e d g e
D a t a - D r i v e n
Figure 2.6: An example of regression tree using combined phonetic knowledge and
data-driven techniques for the phoneme-based approach. Phonemes are firstly
categorized based on phonetic knowledge, and then further clustered according
to their estimated F3 values.
which is most obvious for global PAA. To some extent, this problem can be al-
leviated by increasing the number of regression classes. Since MLLR is able to
offer better performance when more data are available, we attempt to integrate
peak alignment with MLLR, i.e. to perform peak alignment first, followed by
standard MLLR.
Given the peak alignment matrix A and the additive bias vector b, the Gaus-
sian mixture components of speaker specific models are re-estimated using the
EM algorithm [5]. The auxiliary function is defined as
QN (λ, λ) =∑j,k
T∑t=1
ζt(j, k) logN (o(t);Aμjk + b; Σjk) (2.24)
where N (o(t);Aμjk + b; Σjk) is the kth Gaussian mixture of state j. The max-
33
imum likelihood estimation of μjk and Σjk can be derived from
∂QN (λ, λ)
∂μjk
= 0 (2.25)
∂QN (λ, λ)
∂Σjk
= 0 (2.26)
respectively, which give
μjk =
{ T∑t=1
ζt(j, k)ATΣ−1jk A
}−1{ T∑t=1
ζt(j, k)ATΣ−1jk (o(t) − b)
}(2.27)
Σjk =
T∑t=1
ζt(j, k)(o(t) − μjk)(o(t) − μjk)T
T∑t=1
ζt(j, k)
(2.28)
where
μjk = Aμjk + b (2.29)
μjk represents the adapted speaker-specific Gaussian means.
This now can be viewed as a special case of standard speaker adaptive training
(SAT) with only one speaker-dependent model [14, 52]. However, unlike the
statistical estimation of the transforms as in SAT, which requires more adaptation
data, the transformation matrix A is generated deterministically. Therefore,
peak alignment has the potential for better performance than SAT with limited
adaptation data. The integration with MLLR, denoted as PSAT in the following
experiments, can be applied to global or regression-tree based peak alignment.
2.6 Experimental Results
2.6.1 Experimental Setup
Two different recognition tasks were carried out to evaluate the performance of
the proposed algorithm. One was a medium vocabulary recognition task using the
34
DARPA Resource Management RM1 continuous speech database [53], and an-
other was a connected digits recognition task using the TIDIGITS database [54].
For the two databases, speech signals were firstly downsampled to 8kHz, and
then segmented into 25ms frames, with a 10ms shift. Each frame was parame-
terized with a 39-dimensional feature vector consisting of 12 static MFCCs plus
log energy, and their first-order and second-order derivatives.
For the RM1 database, triphone acoustic models were trained on the speaker
independent (SI) portion of the database (72 speakers, 40 utterances from each
speaker). Each triphone model had 3 states with 6 Gaussian mixtures per state.
This set of SI models produced a baseline performance of 89.2% word recognition
accuracy on the test set (10 speakers, 300 utterances from each speaker). Since
the focus here is on rapid adaptation, for each speaker the adaptation data were
limited to no more than 30 utterances for RM1 (or 35 digits for TIDIGTS),
which corresponds to less than 2 minutes for RM1 (or 30 seconds for TIDIGITS).
Adaptation data consisted of 1, 4, 7, 10, 15, 20, 25 or 30 utterances for each
speaker, and they were randomly chosen from the speaker dependent portion of
the database.
For the TIDIGITS task, acoustic models were trained on 55 adult male speak-
ers and then tested on 10 children (5 boys and 5 girls) with 77 utterances consist-
ing of 1, 2, 3, 4, 5, or 7 digits for each speaker. Acoustic HMMs were monophone-
based with 4 states for vowels and 2 states for consonants, and 6 Gaussian mix-
tures per state. The baseline word recognition accuracy was 38.9%. For each
child, the adaptation data, which consisted of 1, 5, 10, 15, 20, 25, 30 or 35 digits,
were randomly chosen from the test set and not used in the test.
In all adaptation experiments, a forward-backward alignment of the adapta-
tion data was first implemented to assign each frame to a regression class (global
35
adaptation can be considered as a special case of regression classes, with only one
class.) For each class, formant-like peaks were then estimated. Depending on
the amount of the adaptation data, different numbers of regression classes were
experimentally tested, and the best performances were selected for comparison.
Fig. 2.7 described the steps for both supervised (steps 2-4) and unsupervised
(steps 1-4) peak alignment adaptation.
Gaussian mixture models were used to estimate formant-like peaks [55]. In
the 4k Hz frequency range, adult speakers were observed to typically have four
formants, while children had only three. Therefore, in the peak alignment proce-
dure, four Gaussian mixtures were used for adults and three for children.
For comparison, speaker-specific VTLN was implemented based on a grid
search over [0.8, 1.2] with a stepsize of 0.02. The scaling factor producing max-
imal average likelihood was used to warp the frequency axis [16]. Since VTLN
is usually applied through warping the power spectrum, the Jacobian determi-
nant is difficult to compute due to non-invertible Mel filter-bank operations. The
Jacobian compensation is approximated by using the determinant of the trans-
formation matrix A (| detA|).
2.6.2 Comparison of Global and Regression-tree based PAA versus
MLLR and VTLN
Experiments were first conducted to compare the performance of global (GPAA),
phoneme-class (PPAA) and Gaussian mixture-class (MPAA) based PAA with
different numbers of adaptation utterances (or digits). In all experiments except
otherwise specified, bias and diagonal covariance adaptation were performed for
PAA. The block-diagonal MLLR adaptation with the optimal number of trans-
forms was also performed for comparison. Figs. 2.8 and 2.9 illustrate the per-
36
For unsupervised adaptation, perform step 1-4; for supervised adaptation, per-
form step 2-4
1. For unsupervised adaptation only: generating transcriptions
• Locate voiced segments using cepstral peak analysis
• Estimate formant-like peaks in the spectrum
• Calculate scaling factor α (Eq. (2.17))
• Generate transformation matrix A (Eq. (2.10))
• Perform spectral peak alignment for each Gaussian mixture mean vec-
tor (without adaptation of bias and covariance) (Eq. (2.30))
• Generate recognition hypotheses (with the partially adapted means)
as transcriptions of the adaptation data
2. Dynamically determine the number of regression classes N based on the
amount of adaptation data and cluster model parameters into N classes
C1, C2, ..., CN
3. Align with transcriptions to assign speech frames to regression classes
4. For each regression class Ci, i ∈ {1, 2, ..., N}
• Estimate formant-like peaks in the spectrum
• Calculate scaling factor αi (Eq. (2.18) or (2.23))
• Generate transformation matrix Ai (Eq. (2.10))
• Estimate biases vector bi and covariance transformation matrix Hi
(Eq. (2.21), (2.22))
• Adapt mean and covariance (Eq. (2.19), (2.20))
Figure 2.7: The speaker adaptation algorithm using regression-tree based spectral
peak alignment for both supervised and unsupervised adaptations37
formance of GPAA, PPAA, MPAA, VTLN and MLLR. Not shown in the figures
are recognition accuracies of MLLR with one adaptation utterance using RM1
(88.2%), and with one and five adaptation digits using TIDIGITS (40.5% and
57.0%, respectively.)
1 4 7 10 15 20 25 3090
92
94
96
Number of adaptation utterances
Wor
d re
cogn
ition
acc
urac
y (%
)
MLLRVTLNGPAAPPAAMPAAG−PA
Figure 2.8: Performance of VTLN, MLLR, G-PA, GPAA, PPAA and MPAA
using RM1 for supervised adaptation.
Fig. 2.8 shows that all three PAA methods can greatly improve the perfor-
mance over the baseline (with no adaptation) in all cases; VTLN and GPAA
provide the best performance with only one adaptation utterance, while PAA
methods outperform VTLN in all other cases. MLLR, however, may produce
worse performance than the baseline when only a small amount of adaptation
data is available. For example, with one adaptation utterance, MLLR produces
recognition accuracy of 88.2%, about one percent lower than the baseline. Com-
pared to MLLR, PAA performs significantly better for limited adaptation data,
38
1 5 10 15 20 25 30 3587
89
91
93
95
97
Number of adaptation digits
Wor
d re
cogn
ition
acc
urac
y (%
)
MLLRVTLNGPAAPPAAMPAAG−PA
Figure 2.9: Performance of VTLN, MLLR, G-PA, GPAA, PPAA and MPAA
using TIDIGITS for supervised adaptation.
with on average about 13.0% reduction of word error rate (WER) over MLLR
for one and four adaptation utterances. With increasing adaptation data, MLLR
offers better results than GPAA when the adaptation data are more than 15
utterances, while MPAA can outperform MLLR for 1-25 adaptation utterances.
Since the covariance adaptation of PAA and MLLR is the same, the main dif-
ferences between speakers seem to be characterized by the means of Gaussian
components.
Among the three PAA methods, MPAA performs the best, and significant im-
provements can be achieved by using regression-tree based PAA over global PAA.
On average, more than 11% WER reduction is obtained with MPAA over GPAA.
The advantage of MPAA over GPAA becomes greater with increasing adaptation
data. For the two regression-tree based PAAs, MPAA performs slightly better
39
than PPAA in all cases. This is because these three PAA methods work at dif-
ferent levels to reduce spectral mismatch: MPAA at the state level, PPAA at
the phoneme level and GPAA at the global level. As discussed in Section 2.2.2
and 2.4, lower-level (phoneme- or state-level) alignment is expected to be more
powerful than the global average to capture subtle differences between phonemes
or even states, provided that the parameters are reliably estimated. Compared
to global average formants used in GPAA, phoneme-level (PPAA) and state-
level (MPAA) formants need to be estimated with more parameters and thus
require more adaptation data. This explains the performance curves of GPAA,
PPAA and MPAA in Fig. 2.8. Experimental results for TIDIGITS (Fig. 2.9)
demonstrate similar trends to Fig. 2.8. This similarity shows that performance
improvements achieved by PAA are consistent across different tasks.
2.6.3 Discussion on Comparison of RM1 and TIDIGITS
Comparing Figs. 2.8 and 2.9, it can be noticed that improvements for TIDIGITS
are more significant than that for RM1 database: with only one adaptation digit
(or utterance), more than 80.0% WER reduction over the baseline was obtained
for TIDIGITS, while for RM1 the WER reduction over the baseline was about
10.5% .
The more significant improvements with the TIDIGITS database can be ex-
plained as follows. The basic idea for PAA is to reduce spectral mismatch by
aligning formant-like peaks using estimated F3 values. The performance im-
provement will be more obvious if the F3 difference between the new speaker and
the standard speaker is significant, which is the case for TIDIGITS: for adult
males the typical F3 is about 2500Hz, and for children it is 3100 Hz. On the
other hand, if the F3 of the new speaker is very close to that of the standard
40
speaker as with the RM1 database which has only adult speakers, the effect of
peak alignment will be less pronounced. An extreme case is when the new speaker
has exactly the same global average F3 value as the standard speaker. In this
case, the global average warping factor α will be 1 (Eq. (2.17)), and the warping
matrix W will be an identity matrix (Eq. (2.13)), which will result in an identity
transformation matrix A (Eq. (2.10)) for global peak alignment (GPAA). 2 Thus,
theoretically, in this case global peak alignment will have little effect on reducing
spectral mismatch, resulting in marginal, if any, performance improvement. This
is also supported by experimental results with the RM1 database using global
peak alignment with only A: the speaker with α closest to 1 shows only 1.5%
average improvement, while the speaker with the largest α achieves over 10%
improvement.
Regression tree based peak alignment may still perform well even in the case
where global peak alignment fails to provide satisfactory improvement, since re-
gression tree based peak alignment utilizes phoneme or state level formant infor-
mation (instead of global average as in global peak alignment), and all phoneme-
or state-level formant values from two different speakers may not be identical.
This is another advantage of regression tree based peak alignment over global
peak alignment.
2.6.4 Performance of the Linearization Appproximation
Since PAA is based on an approximate linearization of VTLN, it is also of interest
to study how good this approximation is. We compared the performance of
VTLN and GPAA using only A (without bias and covariance adaptation), 3
2Strictly speaking, A will not be identity due to the approximation made in the linearizationof VTLN. However, A will be very close to identity with diagonal entries very close to 1 andoff-diagonal entries close to 0.
3This configuration of GPAA can be viewed as a direct linear approximation of VTLN.
41
denoted as G-PA in Figs. 2.8 and 2.9. G-PA performs a little worse than VTLN;
the differences, however, are small. This means that the linearization is a good
approximation to VTLN, and the adaptation of bias and covariance contributes
to the better performance of GPAA.
The peak alignment technique was also compared in [56] with VTLN based
on parameters estimated directly using maximum likelihood criterion, i.e., α was
statistically estimated under ML criterion instead of being defined as the formant
frequency ratios (Eq. (2.17), (2.18) or (2.23)) or being determined using a grid
search. Experimental results showed that GPAA achieves similar performance
to the ML-based VTLN. Peak alignment is, however, more efficient from the
computational point of view. In addition, MPAA outperforms ML-based VTLN
when the adaptation data are more than 5 utterances.
2.6.5 Comparison of PAA, PSAT and MLLR-SAT
In this section, PAA is compared with PSAT which combines peak alignment
followed by MLLR. Gaussian mixture-class based peak alignment (MPAA) is
considered as the reference which performs the best among the three PAA meth-
ods. PSAT is applied in two ways: based on GPAA (PSAT-GPAA) and on MPAA
(PSAT-MPAA).
The performance of MLLR, MPAA and PSAT are shown in Figs. 2.10 and
2.11. Compared to MPAA, PSAT (both PSAT-GPAA and PSAT-MPAA) shows
better performance with improvements, on average, of about 6% with RM1 and
20% with TIDIGITS; compared to GPAA (Figs. 2.8 and 2.9), the improvements
are even more significant (16% with RM1 and 24% with TIDIGITS). Improve-
ment trends are consistent in all cases especially with more adaptation data. As
to the two PSAT methods, PSAT-GPAA is a little better with a small amount of
42
adaptation data, while PSAT-MPAA outperforms PSAT-GPAA when the adap-
tation data are more than 10 utterances.
1 4 7 10 15 20 25 3090
92
94
96
98
Number of adaptation utterances
Wor
d re
cogn
ition
acc
urac
y (%
)
MLLRMPAAPSAT−GPAAPSAT−MPAAMLLR−SAT
Figure 2.10: Performance of MPAA, MLLR, PSAT and MLLR-SAT using RM1
for supervised adaptation.
Compared to MLLR, the performance of PSAT-MPAA is superior in all ex-
periments with on average 14% improvement for RM1 and 23% for TIDIGITS,
though the difference becomes small as adaptation data increase. Significance
analysis shows that for the p-level less than 0.05, the improvement of PSAT-
MPAA over MLLR is statistically significant. This indicates that PSAT can take
advantage of PAA for reliable parameter estimations with limited adaptation
data, and of MLLR for statistical parameter estimations with sufficient adapta-
tion data. Another advantage of PSAT is that it can still perform well even when
there is no difference in global average F3 values between the new speaker and
the standard speaker, in which case PSAT-GPAA becomes equivalent to MLLR.4
4PSAT can be considered as the combination of PAA and MLLR. As discussed in Section
43
1 5 10 15 20 25 30 3588
90
92
94
96
98
Number of adaptation digits
Wor
d re
cogn
ition
acc
urac
y (%
)
MLLRMPAAPSAT−GPAAPSAT−MPAAMLLR−SAT
Figure 2.11: Performance of MPAA, MLLR, PSAT and MLLR-SAT using TIDIG-
ITS for supervised adaptation.
2.6.6 Comparison of PSAT and MLLR-SAT
Since PSAT can be viewed as a special case of MLLR-SAT, which is an alter-
native implementation of SAT through constrained MLLR transformations [14],
it is interesting to compare their performance. The experiments follow the steps
described in [14] and use block diagonal transforms in MLLR-SAT.
The performance of MLLR-SAT is shown in Figs. 2.10 and 2.11. MLLR-SAT
provides better performance than MLLR, decreasing WER by about 10% on av-
erage. However, it performs similarly to MLLR, i.e. they both require a certain
amount of adaptation data (more than 20 utterances) for robust and satisfac-
tory performance. In contrast, PSAT is more robust for limited data, especially
2.6.2, in this case GPAA has little effect on reducing spectral mismatch, and MLLR is creditedwith the performance improvements.
44
PSAT-GPAA, which achieves more than 17% WER reduction over MLLR-SAT
for one adaptation utterance. With the increase of adaptation data, PSAT-MPAA
performs better than PSAT-GPAA and provides comparable performance with
MLLR-SAT. From the computational point of view, PSAT is more efficient than
MLLR-SAT, with only several warping factors instead of a full or block diago-
nal matrix A to be estimated. So PSAT is more suitable for rapid adaptation
where the available enrollment data for a new speaker is limited to only several
utterances.
Another rapid adaptation method is Maximum A Posterior Linear Regression
(MAPLR) [57–59]. MAPLR incorporates prior knowledge into the linear regres-
sion adaptation of means and covariances by using MAP criterion. The hyper-
parameters (parameters needed to describe the prior distribution) are estimated
based on an empirical Bayes (EB) approach [57] and/or the structural informa-
tion of the models [58]. Provided that appropriate priors are chosen, MAPLR
may significantly outperform MLLR. The performance of MAPLR, however, is
highly dependent on the choice of prior distributions [59]. Like MAPLR, prior
knowledge can also be integrated into PAA through the MAP estimation of the
bias b and the covariance transforms H. This work, however, focuses on PAA
in the MLLR framework and leaves the exploration of the PAA in the MAPLR
framework for future work.
2.6.7 Comparison of Supervised and Unsupervised Adaptation
The previous adaptation experiments are implemented in a supervised way where
the true transcription is known. Unsupervised adaptation can be performed by
first generating the transcription through an initial recognition pass. Before this
initial recognition, global peak alignment (without adaptation of bias and covari-
45
ance) is conducted to reduce spectral mismatch. According to Eqs. (2.10), (2.13),
(2.14) and (2.17), the generation of matrix A is only dependent on the warping
factor α which can be estimated from voiced segments and thus requires no tran-
scription knowledge. For each test speaker, formant-like peaks are estimated
from the voiced segments of the adaptation utterance; voicing is detected using
the cepstral analysis technique [60]. Spectral peaks are then aligned with the
average F3, i.e. Gaussian mixture means are adapted according to the following
equation:
μ = Aμ (2.30)
The performance of supervised and unsupervised adaptation is shown in Ta-
bles 2.1, 2.2, 2.3 and 2.4. It should be noted that the performance listed here
for supervised and unsupervised adaptation was based on different numbers of
regression classes: in all cases, the number of classes for unsupervised adapta-
tion was smaller than that of the corresponding supervised case. For example,
for the RM1 database, when the adaptation data consist of 20 utterances, 105
Gaussian mixture classes were found to give the best results for unsupervised
adaptation, while 150 classes were optimal for supervised adaptation. The num-
ber of regression-tree base classes used in MPAA and PSAT for each testing case
is given in the tables in the row labeled “# of classes”. The optimal number of
base classes for MLLR can be different.
From these tables, compared to supervised adaptation, unsupervised peak
alignment adaptation performs slightly worse in all experimental cases, but the
difference is not large: 0.5% and 0.8% absolute WER increase for PSAT-MPAA
using RM1 with 10 and 30 adaptation utterances, respectively; 0.2% and 0.7%
absolute WER increase for PSAT-MPAA using TIDIGITS with 10 and 35 adap-
tation digits. There are two possible reasons for this small difference. One is that
46
Table 2.1: Word recognition accuracy using RM1 for supervised adaptation.
Number of adaptation utterances
1 4 7 10 15 20 25 30
MLLR 88.2 90.8 92.0 93.0 94.0 94.8 95.3 95.9
GPAA 90.8 91.5 92.4 93.3 93.7 94.0 94.3 94.4
MPAA 90.3 91.9 92.8 94.0 94.7 95.1 95.4 95.6
PSAT-MPAA 90.5 92.0 92.9 94.2 95.0 95.6 95.9 96.4
# of classes 10 40 50 75 100 150 175 225
Table 2.2: Word recognition accuracy using RM1 for unsupervised adaptation.
Number of adaptation utterances
1 4 7 10 15 20 25 30
MLLR 86.5 89.3 90.5 91.5 92.3 93.3 94.4 94.6
GPAA 90.7 91.3 92.2 93.1 93.6 93.9 94.0 94.2
MPAA 88.7 90.2 91.3 93.4 94.0 94.2 94.6 94.9
PSAT-MPAA 89.0 90.7 91.8 93.7 94.6 94.9 95.2 95.6
# of classes 5 20 35 60 80 105 135 195
after the global peak alignment, the partially adapted models produce a high
recognition accuracy and thus an acceptable labeling of the adaptation data.
The other is that with a smaller number of classes, it is more likely for unsuper-
vised adaptation to reduce the effect of misclassified frames (due to the initial
recognition errors) and thus to generate robust estimation for the adaptation pa-
rameters. This explains why the unsupervised GPAA performs almost the same
as the supervised case, especially for the highly mismatched TIDIGITS database
with the differences being less than 0.2% in all cases. Compared to GPAA, unsu-
pervised PSAT-MPAA achieves on average 6.8% and 12.7% WER reduction for
RM1 and TIDIGITS, respectively.
47
Table 2.3: Word recognition accuracy using TIDIGITS for supervised adaptation.
Number of adaptation digits
1 5 10 15 20 25 30 35
MLLR 40.5 57.0 88.9 92.8 94.6 95.7 96.6 96.9
GPAA 87.9 93.3 93.5 93.9 94.2 94.4 94.2 94.4
MPAA 88.5 93.9 94.1 94.7 94.9 95.8 95.9 96.3
PSAT-MPAA 88.5 94.0 94.3 94.7 95.0 96.0 96.8 97.4
# of classes 5 25 30 40 55 80 100 125
Table 2.4: Word recognition accuracy using TIDIGITS for unsupervised adapta-
tion.
Number of adaptation digits
1 5 10 15 20 25 30 35
MLLR 38.9 55.3 88.2 92.3 94.5 95.1 95.9 96.1
GPAA 87.7 93.2 93.4 93.8 94.1 94.3 94.2 94.4
MPAA 86.4 92.3 94.0 94.1 94.5 95.3 95.1 95.2
PSAT-MPAA 86.4 92.3 94.1 94.2 94.7 95.6 96.2 96.7
# of classes 3 20 25 25 30 50 75 95
2.6.8 Significance Analysis
We use the matched-pair test proposed in [61] to analyze whether the performance
differences between MLLR and regression-tree based peak alignment (MPAA) are
statistically significant for both the supervised and unsupervised adaptations.
Tables 2.5 and 2.6 show the significance levels (p-value) of MPAA compared to
MLLR for supervised speaker adaptation with various amounts of adaptation
data.
These tables show that, for a given significance level β = 0.05, the average
48
Table 2.5: Significant analysis of performance improvements of MPAA over
MLLR using RM1 for supervised adaptation
# of utterances 1 4 7 10 15 20 25 30
p-value 0.007 0.009 0.013 0.018 0.026 0.031 0.039 0.043
Table 2.6: Significant analysis of performance improvements of MPAA over
MLLR using TIDIGITS for supervised adaptation
# of digits 1 5 10 15 20 25 30 35
p-value 0.001 0.003 0.008 0.012 0.027 0.043 0.025 0.034
performance differences between MPAA and MLLR are statistically significant
using both RM1 and TIDIGITS for supervised adaptation. Examining the sig-
nificance levels for different amounts of adaptation data, we can see that the
performance improvements of MPAA over MLLR are more significant for limited
adaptation data (less than 20 utterances). This is due to the deterministically
generated transforms A in MPAA versus the unreliable statistically estimated
A in MLLR because of not enough adaptation data. Similar conclusions also
hold for the unsupervised adaptations using both the RM1 database and the
TIDIGITS database.
Analysis on PSAT-MPAA and MPAA doesn’t show significant differences be-
tween these two algorithms. The performance improvements of PSAT-MPAA
over MLLR, however, are statistically significant in all the testing cases, at sig-
nificance levels less than 0.05.
49
2.7 Summary and Conclusion
Various levels of spectral mismatch in formant structures cause ASR systems to
perform unsatisfactorily. Regression tree based spectral peak alignment is pro-
posed as a rapid speaker adaptation to reduce phoneme- and state-level spectral
mismatch. This method is investigated in an MLLR-like framework based on
the linearization of VTLN. In the proposed approach, the transformation matrix
for Gaussian mixture means is generated deterministically by aligning phoneme-
and state-level formant-like peaks in the spectrum; adaptation of the bias and
covariance is estimated using the EM algorithm. This method can be viewed as a
combination of VTLN and MLLR. On the one hand, like VTLN, the transforma-
tion matrix for means has fewer parameters than MLLR to be estimated, which is
advantageous for limited adaptation data. On the other hand, like MLLR, biases
and covariances are adapted using the EM algorithm. Statistical estimation has
an advantage when large amounts of adaptation data are available. Hence, the
proposed approach has the potential of good performance for both limited and
large amounts of adaptation data.
The performance of this peak alignment approach is evaluated on both medium
vocabulary (the RM1 database) and connected digits recognition (the TIDIGITS
database) tasks. In both tasks, experimental results show that through peak
alignment adaptation significant performance improvements can be achieved even
for very limited adaptation data, with state-level peak alignment (MPAA) per-
forming the best. When sufficient adaptation data are available, peak alignment
adaptation offers results similar to or slightly worse than MLLR. The PSAT
method which integrates peak alignment with MLLR, however, shows better
performance than MLLR and comparable performance with MLLR-SAT in all
cases. Another merit of this regression-tree based spectral peak alignment is that
50
when implementing adaptation in an unsupervised way, only a slight performance
degradation is observed compared to supervised adaptation.
51
CHAPTER 3
Speaker Normalization based on Subglottal
Resonances
Speaker normalization typically focuses on variabilities of the supra-glottal (vocal
tract) resonances, which constitute a major cause of spectral mismatch. Recent
studies show that the subglottal airways also affect spectral properties of speech
sounds. This chapter presents a speaker normalization method based on estimat-
ing the second subglottal resonance. Since the subglottal airways do not change
for a specific speaker, the subglottal resonances are independent of the sound type
(i.e., vowel, consonant, etc.) and the speaking language, and remain constant for
a given speaker. This context-free property makes the proposed method suitable
for limited data speaker normalization and cross-language adaptation.
3.1 Subglottal Acoustic System and Its Coupling to Vocal
Tract
3.1.1 Subglottal Acoustic System
The subglottal acoustic system refers to the acoustic system below the glottis,
which consists of the trachea, bronchi and lungs. Similar to the vocal tract, the
acoustic input impedance of the subglottal system is characterized by a series
of poles (or resonances) and zeros. Unlike the supraglottal system, however,
52
the configuration of the subglottal system is essentially fixed and thus the poles
and zeros are expected to remain constant for a given speaker. Like formant
frequencies, subglottal resonances are generally higher for female speakers than
for male speakers, and there are substantial individual differences from speaker to
speaker. It has been shown that the lowest three subglottal resonances, namely
Sg1, Sg2, and Sg3, respectively, are around 600, 1450 and 2200 Hz for adult
males, and 700, 1600, and 2350 Hz for adult females [62].
3.1.2 Coupling between Subglottal and Supraglottal Systems
When the glottis is open, the subglottal system is coupled to the vocal tract and
can influence the speech sound output. Fig. 3.1 shows a schematic model of vocal
tract coupling to the trachea through the glottis and its equivalent circuit model,
where Zl is the impedance of the subglottal system, Zg is the glottal impedance,
Zv is the impedance looking into the vocal tract from the glottis, Ug is the volume
velocity through the glottis, and Uv is the airflow into the vocal tract. Coupling
between the subglottal and supraglottal airways is thought to occur primarily
when the glottis is open, such as during a voiceless consonant or the open phase
of glottal vibration in a voiced sound, although [63] and [64] suggest that coupling
may also occur when the vocal folds are closed, either by means of a posterior
glottal opening or the vocal fold tissue itself.
During coupling, each subglottal resonance contributes a pole-zero pair to the
speech spectrum, in addition to the vocal tract pole-zero pairs. The frequency of
the zero is the same as that of the subglottal resonance, while the pole is shifted
upward in frequency away from the resonance and depends somewhat on the
vocal tract configuration. This is because the zero is a function only of the part
of the entire system behind the source (that is, the subglottal airways), while the
53
(a)
Zg
(b)
UgUgZl
Zv
Uv
trachea
glottis vocal tract
bronchi UvUg
Figure 3.1: Schematic model of vocal tract with acoustic coupling to the trachea
through the glottis (a) and the equivalent circuit model (b). Adapted from [62].
pole is a function of the entire system, including the subglottal and supraglottal
airways [63, 65].
3.1.3 Effects of Coupling to Subglottal System
The pole-zero pair introduced in the speech spectrum around Sg2 generally falls
within the range of 1300 to 1500 Hz for adult males, and between 1400 and 1700
Hz for adult females [62]. It is somewhat higher in frequency for children [66].
When F2 crosses the Sg2 pole-zero pair, F2 can jump in frequency or diminish
in amplitude, or both, resulting in a discontinuity in the F2 trajectory [65]. This
is illustrated in Fig. 3.2 for an eight-year-old girl speaking the word boy, and
it is schematically represented in Fig. 3.3. In both figures F2 rises from a low
frequency to a high frequency, crossing the Sg2 pole-zero pair along the way. The
54
F2 discontinuity in Fig. 3.2 is marked by a diminished amplitude in the vicinity
of the zero. The Sg2 pole has a very low amplitude except during the time when
F2 is nearby. In Fig. 3.2 the diffuse energy between F2 and the zero at 250 ms
is likely due to the Sg2 pole, its amplitude decreasing as F2 continues to rise.
Time (ms)
Freq
uenc
y (H
z) Sg2 pole
Sg2 zeroF2 discontinuity
0 50 100 150 200 250 3000
1000
2000
3000
4000
5000
Figure 3.2: Spectrogram for the word boy from an eight-year-old girl. The second
subglottal resonance Sg2 for this speaker is at 1920 Hz.
3.1.4 Subglottal Resonances and Phonological Distinctive Features
Recent studies [67–69] have shown that the acoustic contrasts for some phonologi-
cal distinctive features are dependent on the subglottal resonances. As illustrated
in Fig. 3.4, for example, the vowel feature [back]1 is dependent on the frequency
1The place of articulation feature [+/-back] specifies the tongue positions during speechproduction: [+back] segments are produced with the tongue dorsum bunched and retractedslightly to the back of the mouth, while [-back] segments are bunched and extended slightlyforward.
55
100 150 200 250 3001650
1700
1750
1800
1850
1900
1950
2000
2050
2100
2150
2200
Time (ms)
Fre
quen
cy (
Hz)
Sg2 poleSg2 pole
F2
F2
Sg2 zero
F2 discontinuity
Figure 3.3: Illustration of the F2 discontinuity caused by Sg2. The bold solid
line corresponds to the most prominent spectral peak (F2), which has a jump
in frequency and a decrease in amplitude when F2 is crossing the subglottal
resonance Sg2. The dotted line represents the Sg2 pole, which varies somewhat
in frequency and amplitude when F2 is nearby. The horizontal thin solid line
represents the Sg2 zero, which is roughly constant. Adapted from [62].
of Sg2, such that a vowel with F2 > Sg2 is [-back] and a vowel with F2 < Sg2
is [+back]. The ability of Sg2 to underlie the distinctive feature [back] is likely
derived from the fact that Sg2 is roughly constant for a given speaker. Subglot-
tal resonances could potentially be affected by lung volume, larynx height, and
glottal configuration. Lung volume has been shown not to significantly affect
the subglottal resonances in one study [70], and reported accelerometer measure-
ments of subglottal resonances across utterances (in which phonetic content was
varied and voice quality was uncontrolled - both of which may affect laryngeal
56
height and glottal configuration) have had standard deviations on the order of 30
Hz or less [65]. Thus, although the influence of lung volume, larynx height, and
glottal configuration on subglottal resonances invites further research, the avail-
able evidence appears to indicate that subglottal resonances are roughly constant
under normal speaking conditions.
For this reason, Sg2 might be useful in speaker normalization, since it is
context independent but speaker dependent. Sg1 and Sg3 have also been claimed
to play a role in distinguishing different classes of speech sounds, but Sg2 has been
more thoroughly studied. In this paper, therefore, we focus on Sg2 estimation
and its application to speaker normalization.
3.2 Estimating the Second Subglottal Resonance
3.2.1 Estimation based on Frequency Discontinuity
3.2.1.1 A simple detection algorithm Sg2D1
As noted above, when F2 crosses Sg2, there is a discontinuity in the F2 trajectory.
An automatic Sg2 detector (Sg2D1) is developed based on the frequency discon-
tinuity. The Snack sound toolkit [71] is used to generate the F2 trajectory. The
tracking parameters are specifically tuned to provide reliable F2 tracking results
on children’s speech. Manual verification and/or correction are applied through
visually checking the tracking contours against the spectrogram. (Note that this
method is limited by the accuracy of the formant tracker, which is known to
encounter difficulties in high-pitched speech such as that produced by children.)
The F2 discontinuity is detected based on the smoothed first order difference of
the F2 trajectory, as shown in Fig. 3.5. If the F2 values on the high and low
frequency side of the discontinuity are F2high and F2low, respectively, then the
57
0
500
1000
1500
2000
2500
3000
Fre
quen
cy (
Hz)
Sg1
Sg2
Sg3
[−back] [+back]
i I E æ a 2 O U u
Figure 3.4: Illustration of the relative positions of vowel formants F1 (·), F2 (+)
and F3 (x) and the subglottal resonances (Sg1, Sg2 and Sg3) for an adult male
speaker. For the vowels /i, I, E, æ/ F2 > Sg2, and they are therefore [-back]. For
the vowels /a, 2, O, U, u/ F2 < Sg2, and they are therefore [+back]. Adapted
from [67].
algorithm estimates Sg2 as:
ˆSg2 =F2high + F2low
2(3.1)
If no such discontinuity is detected, the algorithm uses the mean F2 over the
utterance. In many such cases, such as during a monophthong, F2 is consistently
above or below Sg2, and the mean F2 value is either too high or too low. Thus,
the estimated Sg2 values are dependent on the speech sound analyzed.
Furthermore, discontinuities in F2 may arise from other factors beside the
subglottal resonances, including pole-zero pairs from the interdental spaces [72].
58
5 10 15 20 25 30 35 40 45 501800
1900
2000
2100
2200
F2low
F2high
Detected Sg2
Manually measured Sg2
Frame by frame F2 track
F2
(Hz)
5 10 15 20 25 30 35 40 45 500
50
100
150
200
Smoothed first order difference of the F2 track
threshold
Figure 3.5: An example of the detection algorithm.
These discontinuities occur a few hundred Hz higher than Sg2 discontinuities,
but are sometimes more prominent than Sg2 discontinuities and can therefore be
mistakenly detected as Sg2.
3.2.1.2 An improved detection algorithm Sg2D2
To address both issues, an improved Sg2 estimation algorithm (Sg2D2) is then
developed. There are two main differences between Sg2D2 and Sg2D1, namely:
a. Sg2D2 applies an empirical formula to guide the discontinuity search, which
serves as a starting point for the search, and also as a back-off point in cases
where no discontinuities are detected.
b. Sg2 uses a statistical method to estimate Sg2 from F2high and F2low, instead
of simply using the average as in Sg2D1.
59
The Sg2D2 algorithm works as follows. It first detects F3 and obtains an
estimate of Sg2 using a formula derived in [63]:
˜Sg2 = 0.636 × F3 − 103 (3.2)
Note that the derivation of this formula is based on a linear regression on chil-
dren’s speech data which have available simultaneous accelerometer recordings,
and its extension to adult speech may still need further refinements.
The algorithm then searches for a discontinuity within ±100 Hz of this esti-
mate using the original algorithm. The range ±100Hz is chosen based on calcu-
lated standard deviations of Sg2 on the calibration data. If no discontinuity in
this range is found, ˜Sg2 in Eq. (3.2) is used. If a discontinuity is found, Sg2 is
estimated using the following equation:
ˆSg2 = β × F2high + (1 − β) × F2low (3.3)
where β is a weight in the range (0, 1) that controls the closeness of the detected
Sg2 value to F2high. The optimal value of β is calibrated over the data described
below using the minimum mean square error criterion:
β = arg min E{( ˆSg2 − Sg2)2} (3.4)
and is found to be 0.65 in our experiments.
3.2.2 Estimation based on Joint Frequency and Energy Measurement
The estimation method in Section 3.2.1 is based solely on F2 frequency dis-
continuities. Though simple and efficient, this method may produce unreliable
estimates in cases where F2 discontinuities are not detectable. Speech analysis
studies have shown that discontinuities and attenuations of formant prominence
60
typically occur near resonances of the subglottal system [65]. Take Sg2 for exam-
ple, which has been more thoroughly studied than other subglottal resonances.
When F2 approaches Sg2, an attenuation of 5-12dB in F2 energy prominence
(E2) is always observed, while an F2 frequency discontinuity in the range of
50-300Hz often occurs.
Since E2 attenuation always occurs when F2 crosses Sg2, a joint F2 and E2
measurement (Sg2DJ) is developed to improve the reliability of Sg2 estimation.
The detection algorithm works as follows:
1. Track F2 and E2 frame by frame using LPC analysis and dynamic pro-
gramming. The F2 tracking algorithm is similar to that used in Snack [71],
with parameters specifically tuned to provide reliable F2 tracking results on
children’s speech. Manual verification and/or correction is applied through
visually checking the tracking contours against spectrogram.
2. Search within ±100 Hz around ˜Sg2 (Eq. (3.2)) for F2 discontinuities (F2d)
and E2 attenuation (E2a).
3. Check if F2d and E2a correspond to the same location. Apply decision
rules for Sg2 estimation.
The decision rules are biased toward E2 attenuations, since E2 attenuations are
more correlated with Sg2. If the time information of F2 discontinuity matches
that of E2 attenuation, as shown in Fig. 3.6, Eq. (3.3) is used for Sg2 estimation.
Otherwise, if F2 discontinuities are not detectable or F2 discontinuities and E2
attenuations disagree, as shown in Fig. 3.7, the estimation will only rely on E2
attenuation, and uses the average F2 value around E2a as Sg2. If in some extreme
cases E2 attenuation is not detectable, which rarely occurs in our experiments,
then Eq. (3.2) would be used for Sg2 estimation. In other words, in cases where
61
F2 discontinuities are detectable and agree with energy attenuations, Sg2DJ gives
exactly the same estimates as Sg2D2 does; while in other cases, Sg2DJ and Sg2D2
may provide different estimates.
26 28 30 32 34 36 38 40 421000
1500
2000
2500
Frame by frame F2 track
F2
(Hz)
26 28 30 32 34 36 38 40 42−15
−10
−5
0
5
Frame by frame E2 track
E2
(dB
)
F2 discontinuity
E2 attenuation
Figure 3.6: Example of the joint estimation method where F2 discontinuity and
E2 attenuation correspond to the same location (frame 38). Eq. (3.3) is used to
estimate Sg2.
3.3 Calibration of the Sg2 Estimation Algorithm
To verify and calibrate our Sg2 estimation algorithm, acoustic data were collected
from six female children aged 2 to 17 years old (speakers G1-G6 in [63]). The
children were native speakers of American English and all of them except the
youngest were recorded repeating the phrase ‘hVd, say hVd again’ three times for
each of the vowels [i], [I], [E], [æ], [a], [2], [o], [U], and [u]. The subjects also recited
62
40 45 50 55 602100
2200
2300
2400
Frame by frame F2 track
F2
(Hz)
40 45 50 55 60−5
0
5
10
Frame by frame E2 track
E2
(dB
)
E2 attenuation
Figure 3.7: Example when there is a discrepancy between locations of F2 discon-
tinuity (not detectable) and E2 attenuation (at frame 51). The average F2 value
within the dotted box is then used as the Sg2 estimate.
the alphabet, counted to 10, and recited a few short sentences. The recording
list was presented in random order and verbally prompted by the experimenter.
The youngest speaker (G1) was recorded counting to 10, reciting the alphabet,
and answering questions of the sort ‘What is this?’, in which the experimenter
pointed to his hand or head, for instance. All utterances were recorded in a sound-
isolated chamber using a SHURE BG4.1 uni-directional condenser microphone,
and an accelerometer. Both the speech and accelerometer signals were digitized
at 16kHz. Microphone signals of each speaker were used to measure average F3
and the discontinuity in the F2 track. An independent direct measure of the
average Sg2 for each speaker was obtained from an accelerometer signal. The
accelerometer was attached to the skin of the neck below the larynx so that the
63
measured vibration of the neck skin is directly related to the acoustic pressure
variations in the air column at the top of the trachea [65,70]. The accelerometer
signal can therefore act as a stand-in for the subglottal input impedance, in which
the subglottal resonances appear as formants in the accelerometer spectrum.
The detection algorithms Sg2D1 and Sg2D2 were calibrated (to estimate dis-
continuity thresholds for both Sg2D1 and Sg2D2, and β for Sg2D2) on data from
two of the recorded children and tested on the remaining four. The values mea-
sured from the accelerometer data were used as the ‘ground truth’ Sg2 frequencies
(henceforth denoted by ‘Sg2Acc’). The average Sg2 estimates (with standard de-
viations) over various vowel contents are shown in Table 3.1. Compared to Sg2D1,
the algorithm Sg2D2 estimates Sg2 much better with less variance across vowels.
The observed standard deviation values of Sg2D2 are similar to those from man-
ually measured Sg2’s (Sg2M2) in this study and those found in other studies [66].
The performance of these two algorithms was investigated in more detail for
each vowel for two speakers and the results are shown in Table 3.2 and Fig.
3.8. As stated earlier, if no discontinuity in the F2 track is detected (as for the
vowels above the double line, Table 3.2), Sg2D1 uses the mean F2 as Sg2 and
thus is highly dependent on vowel content. Sg2D2, on the other hand, uses a
formula to estimate Sg2 from F3 which is less content-dependent than F2. In
such cases, it can be seen that the formula in Sg2D2 gives much closer estimates
to the ground truth, especially for mid and back vowels. For the case when there
is a discontinuity in the F2 trajectory (as for the diphthongs below the double
line), both algorithms work well when the F2 discontinuity is from Sg2, as for
speaker 1. In this case, Sg2D1 gave an estimate within about 70Hz of the true
2The manual Sg2’s were estimated through visually examining the speech spectrogram, andthen applying Eq. (3.2) or Eq. (3.3) depending on the existence of F2 discontinuities.
64
Table 3.1: Comparison of Sg2 estimates for two algorithms over various vowel
contents, where Sg2M is the manual measurement from speech spectrum, and
Sg2Acc is the ‘ground truth’ measurement from the accelerometer signal. For each
algorithm the average Sg2 estimates (Hz) are shown (with standard deviations
in parentheses). The two speakers with a * are those used for calibration.
Speaker Sg2D1 Sg2D2 Sg2M Sg2Acc
1 2135 (531) 2194 (95) 2193 (97) 2176
2 2115 (334) 1766 (137) 1719 (112) 1646
3 2586 (467) 2718 (143) 2634 (135) 2679
4 2098 (358) 1823 (151) 1781 (129) 1614
5* 2065 (267) 2021 (79) 2013 (76) 1970
6* 1612 (251) 1689 (72) 1681 (65) 1648
Sg2 value, while the Sg2D2 estimate was within less than 10Hz. For speaker 2,
where the most prominent F2 discontinuity was probably from the interdental
space, Sg2D1 gave an estimate hundreds of Hz above the Sg2 value, while Sg2D2
roughly located the correct Sg2 value using Eq. (3.2). Thus, Sg2D2 is less
prone to mistakenly detecting discontinuities not caused by Sg2. In addition to
diphthongs, discontinuities in F2 should also be detectable in certain consonant-
vowel transitions [63]. Since Sg2D2 performs consistently better than Sg2D1,
we’ll focus only on Sg2D2 in the following experiments. As shown in Tables 3.1
and 3.2, and Fig. 3.8, the proposed detector produces Sg2 estimates close to
the ground truth. Also, as will be shown in the experimental section (Section
3.5), the estimated Sg2 helps to improve ASR’s performance on children’s speech,
which is of primary interest to us.
The algorithm Sg2DJ was also evaluated and compared to the F2 discontinuity-
based detection algorithm Sg2D2. Improved accuracy was achieved in cases where
65
Table 3.2: Detailed comparison of Sg2 estimates for the two algorithms on two
speakers. For vowels above the double line, there are no discontinuities in the
F2 trajectory, and Sg2D1 uses the mean F2 as Sg2 while Sg2D2 uses Eq. (3.2)
( ˜Sg2) to make an estimate; for vowels below the double line, the F2 discontinuity
is detectable, and Sg2D1 uses Eq. (3.1) while Sg2D2 uses Eq. (3.3). The row
‘Avg.(std)’ shows the mean (and standard deviation) for each algorithm.
Vowel
Speaker 1 (age 6) Speaker 2 (age 13)
Sg2Acc: 2176Hz Sg2Acc: 1646Hz
Sg2D1 Sg2D2 Sg2D1 Sg2D2
[i] 2987 2312 2563 1971
[I] 2515 2306 2439 1909
[e] 2894 2115 2629 1998
[E] 2799 2291 2378 1867
[æ] 2382 2289 2350 1863
[a] 1599 2020 1796 1700
[2] 1687 2243 1948 1704
[o] 1512 2185 1497 1613
[U] 1578 2228 1964 1717
[u] 1739 2071 1825 1631
[au] 1841 2114 1974 1617
[aI] 2103 2170 2072 1709
[OI] 2115 2183 2063 1659
Avg.(std) 2135 (531) 2194 (95) 2115 (334) 1766 (137)
F2 discontinuities and E2 attenuations disagree. These F2 discontinuities may
be caused by factors other than subglottal resonances, e.g., probably from inter
dental space.
Since for most of the speakers there are no significant performance differ-
66
1500
2000
2500
3000
Est
imat
ed S
g2 fr
eque
ncy
(Hz)
1000
1500
2000
2500
3000
Est
imat
ed S
g2 fr
eque
ncy
(Hz)
Sg2D1 Sg2D2 Ground truth Sg2D1 Avg. Sg2D2 Avg.
i
i
I
I
E
E
æ
æ
a
a
2
2
o
o
U
U
u
u
au
au
e
e
aI
aI
OI
OI
Figure 3.8: Comparison of Sg2 estimates for the two speakers in Table 3.2, top
panel for speaker 1 and bottom panel for speaker 2.
ences between the two algorithms Sg2DJ and Sg2D2, and Sg2D2 is more efficient
than Sg2DJ, in the following experiments, Sg2D2 is used for estimation unless
otherwise stated.
3.4 Variability of Subglottal Resonance Sg2
3.4.1 The Bilingual Database
The acoustic characteristics of children’s speech have been shown to be highly
different from those of adult speech, in term of pitch and formant frequencies,
67
segmental durations, and temporal and spectral intra- and inter-speaker variabil-
ities [41, 42]. Studies of subglottal resonances [65, 67–69, 73, 74], however, have
mainly focused on adult speech in English with little effort devoted to children’s
speech or to other languages [66,75,76]. This section analyzes children’s speech in
English and Spanish, investigating the variabilities of Sg2 under different contents
and across languages.
To examine the cross-language variability of Sg2 frequencies, we recorded a
database (ChildSE) of 20 bilingual Spanish-English children (10 boys and 10
girls) in the 1st or 2nd grade (around 6 and 7 years old, respectively) from
a bilingual elementary school in Los Angeles. The recorded speech consisted of
words containing front, mid, back, and diphthong vowel. There were four English
words: beat, bet, boot, and bite, and five Spanish words (with English meanings
in parentheses): calle (street), casa (house), quitar (to take out), taquito (taco)
and cuchillo (knife). All the words were familiar to the children.
Prior to the recording, children were instructed to practice as many times as
they wanted. Both text and audio samples for each target word were available
for prompt, and children decided what prompt they needed during recording
and what language they wanted to record first. There were three repetitions
for each word, and children spoke all the words in one language in a row with
3 seconds pause between words, and then repeated them. After they finished
the recordings in one language, there was about a one-minute pause before they
began the recordings in the other language. Recordings were made with 16 kHz
sampling rate and 16-bit resolution.
Like the English word bite [baIt] with a diphthong [aI], the Spanish words
calle [kaje] and cuchillo [kutSijo] had obvious F2 discontinuities. We used these
words with diphthongs to estimate Sg2 frequencies. Therefore, for each speaker,
68
there were three English tokens and six Spanish tokens for the Sg2 estimation.
3.4.2 Cross-content and Cross-language Variability
The within-speaker standard deviations were calculated on the Sg2 values esti-
mated from the six Spanish tokens for each speaker. The within-speaker coeffi-
cients of variation (COV) was also calculated, which can be viewed as a measure
of dispersion of a probability distribution. The COV was computed as the ratio of
the standard deviation to the mean Sg2 value for each speaker. As shown in Fig.
3.9, the within-speaker Sg2 standard deviations are around 20Hz and the COV is
less than 0.01. No significant difference in the COVs is observed between genders.
A similar trend is observed for the within-speaker Sg2 standard deviations calcu-
lated from the English tokens. Compared to the COV of formant frequencies [41],
which are usually around 0.10, the COV of Sg2 is about one order of magnitude
smaller. Therefore the within-speaker Sg2 variability is negligible since they are
sufficiently small compared to formant variabilities. This means that for a given
speaker, Sg2 is relatively constant relative to content and repetition.
Since Sg2 frequency for a given speaker does not depend on context, it makes
sense to calculate the Sg2 COV for each speaker over the three English tokens
and six Spanish tokens and view that as the Sg2 cross-language variability, which
is plotted in Fig. 3.10. The cross-language Sg2 COVs are less than 0.01, and
there is no significant difference between genders. The cross-language COVs are
similar to the within-speaker COVs, indicating that the cross-language effects are
not significant for Sg2 frequencies and the Sg2 frequency for a given speaker is
independent of language.
69
1 2 3 4 5 6 7 8 9 1010
15
20
25
Sg2
var
iatio
ns (
Hz)
1 2 3 4 5 6 7 8 9 10.007
.008
.009
.010
.011
Speaker No.
CO
V o
f Sg2
Male
Female
Figure 3.9: Average within-speaker Sg2 standard deviations and the COVs
against contents and repetitions.
3.4.3 Implications of Sg2 Invariability
Because of its invariability across speech content and language, Sg2 is judged to
be applicable to speaker normalization. Since Sg2 is content-independent, it is
hypothesized that the performance of speaker normalization using Sg2 should be
robust and independent of the amount of adaptation data available. This would
make the Sg2 normalization method greatly suitable for limited data adaptation,
which is often the case in ASR applications.
On the other hand, the language-independent property of Sg2 makes cross-
language adaptation possible based on Sg2 normalization. Theoretically, with
Sg2 normalization acoustic models trained in one language could be adapted
with data in any other language, which may be useful in ASR applications for
70
1 2 3 4 5 6 7 8 9 10.008
.009
.010
Speaker No.
CO
V o
f Sg2
MaleFemale
Figure 3.10: Cross-language within-speaker COV of Sg2 for 10 boys and 10 girls.
second-language learning.
3.5 Experiments with Linear Frequency Warping
Similar to formant normalization, the warping ratio for Sg2 normalization is
defined as:
α = Sg2r/Sg2t (3.5)
where Sg2r is the reference Sg2 and Sg2t is the Sg2 of the test speaker. The
reference Sg2 is defined as the mean value of all the training speakers’ Sg2’s. The
Sg2 values are detected using the Sg2D2 algorithm. In this section, we evaluate
the content dependency of Sg2 normalization and also its use for cross-language
normalization. The simple linear frequency warping is applied in this section,
and nonlinear frequency warping will be addressed in the next section.
71
3.5.1 Comparison of VTLN and Sg2 Frequency Warping
Fig. 3.11 shows F1, F2 and F3 values from a nine-year-old girl before and after
warping using VTLN [16] and the Sg2 ratio. The line ‘Sg2’ is the reference second
subglottal resonance for an adult male speaker (as in Fig. 3.4). Compared to
Fig. 3.4, unwarped data demonstrate an obviously different pattern as to the
relative positions of the formants with respect to the reference Sg2. For instance,
the back vowels [U] and [u] have higher F2 values than the reference Sg2, while
in Fig. 3.4 F2’s of all the back vowels lie below the Sg2 line. It is necessary
to apply frequency warping to achieve the reference formant position pattern.
Both VTLN (in circles) and Sg2 (in squares) warp the formants close to the
reference pattern, although Sg2 warping yields a formant pattern more similar to
the reference speaker’s.
To examine the effects of warping in more detail, Fig. 3.12 plots the reference
F1, F2 and F3 values versus the normalized values. It can be seen that Sg2
warping aligns the test speaker’s formants more closely to the reference speaker’s
formants (Fig. 3.4), as indicated by the proximity of the data points to the
diagonal line (with slope 1). In ASR such warping results in greatly reduced
spectral mismatch between test and reference speakers, and thus can lead to
better ASR performance.
3.5.2 Effectiveness of Sg2 Normalization
Since VTLN has been shown to provide significant performance improvement
on children’s speech recognition, the subglottal normalization method is first
evaluated on a connected digits recognition task of children’s speech using the
TIDIGITS database. Speech signals were segmented into 25ms frames, with a
10ms shift. Each frame was parameterized by a 39-dimensional feature vector
72
0
500
1000
1500
2000
2500
3000
3500
4000
Fre
quen
cy (
Hz)
Sg1
Sg2
Sg3
i I E æ a 2 O U u
Figure 3.11: Vowel formants F1 (·), F2 (+) and F3 (x) before and after VTLN
(in circles) and Sg2-based (in squares) warping for a nine-year-old girl’s vowels.
The lines ‘Sg1’, ‘Sg2’ and ‘Sg3’ are the reference subglottal resonances from the
same speaker as in Fig. 3.4.
consisting of 12 static MFCCs plus log energy, and their first- and second-order
derivatives. Acoustic HMMs were monophone-based with 3 states and 6 Gaussian
mixtures in each state. VTLN was implemented based on a grid search over
[0.7, 1.2] with a stepsize of 0.01. The scaling factor producing maximal average
likelihood was used to warp the frequency axis [16].
In this setup, acoustic models were trained on 55 adult male speakers and
tested on 50 children. The baseline word accuracy is 55.76%. For each child, the
adaptation data, which consisted of 1, 4, 7, 10, 13 or 16 digits, were randomly
chosen from the test subset to estimate the Sg2 and VTLN warping factors. For
comparison, the performance of manually measured Sg2 is also tested, which in
73
200 400 600 800200
300
400
500
600
700
800
900
1000
1100
F1 of the reference speaker (adult male)
F1
of a
test
spe
aker
(9
year
old
girl
)
F1
500 1000 1500 2000 2500500
1000
1500
2000
2500
3000
F2 of the reference speaker (adult male)
F2
of a
test
spe
aker
(9
year
old
girl
)
F2
2000 2500 30002000
2200
2400
2600
2800
3000
3200
3400
3600
F3 of the reference speaker (adult male)
F3
of a
test
spe
aker
(9
year
old
girl
)
F3
Figure 3.12: Vowel formants F1 (·), F2 (+) and F3 (x) from the reference speaker
(Fig. 3.4) versus those from the test speaker (Fig. 3.11) before and after warping
(VTLN in circles, Sg2 in squares). The dotted line is y = x which means perfect
match between reference and test speakers.
some sense can be viewed as the upper bound of this Sg2 normalization method.
For each speaker, the manual Sg2 was measured from only diphthong words
containing obvious F2 discontinuities in the spectrum, and, independent of adap-
tation data, the same Sg2 value was applied for normalization. Fig. 3.13 shows
the recognition accuracy for VTLN, F3 and Sg2 warping with various amounts
of adaptation data, where Sg2M represents results using the manually measured
subglottal resonance.
When the amount of adaptation data is small, Sg2 normalization offers better
performance than VTLN. For instance, with only one digit for normalization, Sg2
normalization outperforms VTLN by more than 2%. VTLN outperforms Sg2D2
when more data is available, while the Sg2M provides slightly better performance
to VTLN even with 16 adaptation digits. The improvements of Sg2 normalization
74
1 4 7 10 13 1690
90.5
91
91.5
92
92.5
93
93.5
94
94.5
95
Number of adaptation digits
Wor
d re
cogn
ition
acc
urac
y
VTLNF3Sg2D2Sg2M
Figure 3.13: Speaker normalization performance on TIDIGITS with various
amount of adaptation data.
over VTLN for up to 10 adaptation digits are statistically significant for p < 0.05.
Although automatic detection of Sg2 was fairly accurate, it was not exact and
there is thus a gap between the performance of the automatic detection method
and that of Sg2M. With more accurate Sg2 detection algorithms, we may expect
closer performance to that of the manual Sg2.
3.5.3 Comparison of Vowel Content Dependency
As discussed in 3.3, Sg2 is not always detectable from acoustic signals, but that
Sg2 detectability in adaptation data is important to the normalization process.
To investigate the content dependency of the detection algorithm Sg2D2, its nor-
malization performance is evaluated on TIDIGITS database with one adaptation
digit. For each child, the adaptation data were limited to only one digit but with
75
varying vowels from front vowel (e.g., [I] in six), central vowel (e.g., [2] in one),
back vowel (e.g., [u] in two) to diphthong (e.g., [aI] in five). The adaptation digits
were chosen such that F2 discontinuities, if any, come only from vowel contents
without any possible interferences from consonant-vowel transitions [63].
The performance comparison for VTLN, F3 and Sg2 normalization is shown
in Fig. 3.14. It can be seen that the choice of adaptation data can potentially
have an effect on the normalization performance for all three methods. Among
them, VTLN is least affected by the choice of adaptation data (the performance
standard deviation is 0.55), while F3 normalization is highly data dependent. The
performance of Sg2 normalization is less content sensitive compared to F3 nor-
malization, but more content dependent than VTLN. We expect that the content
dependency of Sg2 normalization will decrease with improved Sg2 detection algo-
rithms. In spite of its greater content dependency, on average Sg2 normalization
provides better performance than VTLN.
3.5.4 Performance on RM1 Database
Since the TIDIGITS setup is a highly mismatched case, the experiments demon-
strate the effectiveness of subglottal resonance-based speaker normalization. To
further verify the effectiveness of this method, the performance is also tested on
a medium vocabulary recognition task using the DARPA Resource Management
RM1 continuous speech database. As a next step, the method is evaluated on
the RM1 database for both a medium-mismatched case and a matched case. Tri-
phone acoustic models were applied with 3 states and 4 Gaussian mixtures per
state using the same features as in the TIDIGITS experiments. For the mis-
matched case, HMM models were trained on 49 adult male speakers from the
speaker independent (SI) portion of the database, and tested on 23 adult female
76
front vowel (e.g. six) central vowel (e.g. one) back vowel (e.g. two) diphthong (e.g. five)90
90.5
91
91.5
92
92.5
93
93.5
94
94.5
Wor
d re
cogn
ition
acc
urac
y
VTLNF3Sg2D2
Figure 3.14: Performance comparison of VTLN, F3 and Sg2D2 using one adap-
tation digit with various vowel content.
speakers in the SI portion. The baseline word recognition accuracy was 59.10%.
For the regular test on RM1, the HMM models were trained on the SI training
portion of the database with 72 adult speakers, and tested on the SI testing set.
The baseline performance was 92.47% word recognition accuracy. In both cases,
the same utterance was used to estimate the Sg2 and VTLN warping factor for
all speakers. Table 3.3 shows the results.
For the mismatched case, Sg2 normalization provides better performance than
VTLN with about 1.5% absolute improvement. This improvement is statistically
significant for p < 0.01. For the matched case, Sg2 normalization provides com-
parable performance to that of VTLN. From the computational point of view,
Sg2 normalization is more efficient than VTLN, since VTLN relies on an ex-
haustive grid search over the warping factors to maximize the likelihood of the
77
Table 3.3: Performance comparison (word recognition accuracy) on RM1 with
one adaptation utterance.
Accuracy mismatched matched
Baseline 59.10 92.47
F3 79.01 92.58
VTLN 86.65 93.91
Sg2 88.37 94.05
adaptation data, while for Sg2 normalization the main computational cost comes
from formant tracking which can be done efficiently.
3.5.5 Cross-language Speaker Normalization
The language-independent property of Sg2 makes cross-language adaptation pos-
sible based on Sg2 normalization. In this experiment, training and test data
were in English, while the adaptation data were in either English or Spanish.
The warping factors were estimated from the adaptation data using Sg2D2 and
applied to the test data to warp the spectrum. English adaptation data were
collected for comparison.
The performance was evaluated on the Technology Based Assessment of Lan-
guage and Literacy (TBall) project database [77], and the English high frequency
words for 1st and 2nd grade students were used in the test. Monophone acoustic
models were trained on speech data from native English speakers. The test data
were from the same 20 speakers as in the ChildSE. The ChildSE utterances (only
one repetition) were used as adaptation data, and for each speaker there were
four English words and five Spanish words for adaptation.
The typical text-dependent VTLN method using HMM recognizers for warp-
78
ing factor searching is not quite suitable in this scenario, because decoding Span-
ish speech with English phoneme models could itself introduce a systematic error
due to different phonetic characteristics between these two languages. Instead,
for a fair and reasonable comparison, text-independent VTLN is applied, which
uses Gaussian mixture models (GMM) for warping factor searching. A GMM
with 512 mixtures was trained on English training set, and then applied to cal-
culate the likelihood for each warping factor in the range [0.8,1.2] with a step
size of 0.01. The warping factor with the highest likelihood was chosen as the
VTLN warping factor. Compared to the text-dependent VTLN used in [78], this
text-independent method provides similar performance with English adaptation
data, but much better for Spanish adaptation data. The subglottal resonance was
estimated using Sg2D2 for each word, and the average was used as the speaker’s
Sg2 frequency. The Sg2 warping factor was calculated using Eq. (3.5).
The normalization performance is shown in Table 3.4 for VTLN and Sg2 us-
ing English and Spanish adaptation data. When adaptation data are in English,
which is the same language as for the acoustic models, Sg2 normalization and
VTLN give comparably good results. For Spanish adaptation data, however,
the performance of VTLN degrades, while the performance of Sg2 normalization
remains similar as for English adaptation data. Sg2 normalization, therefore,
produces more robust results than VTLN when performing cross-language adap-
tation. The performance difference between using Sg2D2 and using VTLN is
statistically significant with Spanish adaptation data for p < 0.01.
79
Table 3.4: Performance comparison (word recognition accuracy) of VTLN and
Sg2 normalization using English (four words) and Spanish (five words) adaptation
data. The acoustic models were trained and tested using English data.
MethodLanguage of adaptation data
English Spanish
VTLN 86.61 82.35
Sg2 86.59 85.97
3.6 Nonlinear Frequency Warping
3.6.1 Mel-shift based Frequency Warping
Given a warping function W (f), the spectrum S(f) is transformed into
S ′(f) = S(W (f)) (3.6)
where f is the frequency scale in Hz. For computational efficiency, W (f) usually
involves only one parameter, the warping factor α. A simple yet effective warping
function is a linear scaling function:
W (f) = Wα(f) = α · f (3.7)
In conventional VTLN, the optimal warping factor is usually estimated using a
grid search to maximize the likelihood of warped observations given an acoustic
model λ:
α = arg maxα∈G
R∑r=1
log p(Or(Wα(f))|λ, sr) (3.8)
where sr is the transcription of the rth speech file Or, and G is the search grid.
Though widely used, the linear scaling model in Eq. (3.7) is known to be
a crude approximation of the way vocal tract variations affect spectrum. The
warping factor between speakers is also observed to be frequency dependent [79].
80
Motivated by speech analysis, [79,80] proposed a shift-based nonlinear frequency
warping, i.e., to shift upward or downward the Mel scale, which results in nonlin-
ear warping in Hz. As opposed to a linear Wα(f), the warping function is defined
as:
Wα(z) = z + α (3.9)
where z is in Mel scale3:
z = Mel(f) = 1127 log(1 +f
700) (3.10)
The Mel-shift function corresponds to a non-linear relationship in Hz:
f ′ = eα
1127 · f + 700(eα
1127 − 1) (3.11)
Similar to the linear warping method, the optimal warping factor α for shift-based
methods can be estimated using the ML criterion.
3.6.2 Bark-shift based Frequency Warping
In this section, a Bark-scale shift based warping function is investigated as defined
in Eq. (3.9), but where z is now in Bark scale:
z = Bark(f) = 6 log(f
600+
√(
f
600)2 + 1) (3.12)
Inserting Eq. (3.12) into Eq. (3.9), the frequency (Hz) domain relationship
corresponding to a Bark shift can be derived:
6 log(f ′
600+
√(
f ′
600)2 + 1) = 6 log(
f
600+
√(
f
600)2 + 1) + α (3.13)
f ′ = 300eα6 [
f
600+
√(
f
600)2 + 1] −
300e−α6
f
600+
√( f
600)2 + 1
(3.14)
3In [79], the coefficient 1127 is changed to 1. Throughout this paper, the standard Mel scalein Eq. (3.10) is used.
81
In general the relationship in Eq. (3.14) is nonlinear and complicated. How-
ever, we can approximate Eq. (3.13) as:
⎧⎪⎨⎪⎩
f ′ = eα6 · f, for f � 600 Hz
f ′ = eα6 · f + 600(e
α6 − 1), for f � 600 Hz
(3.15)
For high frequency f � 600 Hz, the Bark shift corresponds to a linear scaling in
Hz as Eq. (3.7); while for low frequency f � 600 Hz, the Bark shift results in
an affine relationship in Hz as the Mel shift (Eq. (3.11)). In general, the Bark
shift warping function stretches or compresses lower frequencies more than higher
frequencies.
To preserve the frequency bandwidth after warping, a piecewise nonlinear
warping function, shown in Fig. 3.15, is applied such that the lower boundary
frequency fmin (or zmin) and the upper boundary frequency fmax (or zmax) are
always mapped to themselves, i.e.,
Wα(z) =
⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
zl+α−zmin
zl−zmin· (z − zmin) + zmin, if z ≤ zl
z + α, if zl < z < zu
zmax−zu−αzmax−zu
· (z − zu) + zu + α, if z ≥ zu
(3.16)
This Bark-shift based piecewise nonlinear warping function differs from pre-
vious Bark-scale based approaches [81–83] in two aspects. First, the previous
methods apply modifications to the Hz-Bark conversion formula directly, which
make it difficult to implement in an uniform filter bank analysis framework. In
contrast, the proposed method can be easily implemented by modifying filter
bank analysis for computational efficiency. Second and the most important, the
piecewise function in Eq. (3.16) compensates for bandwidth mismatch, while
the warping functions in [81–83] change frequency bandwidth, which result in
82
original scale (bark)
war
ped
scal
e (b
ark)
α > 0
α < 0
α = 0
zmin
zmax
zmax
zl
zu
Figure 3.15: Piecewise bark shift warping function, where α > 0 shifts the Bark
scale upward, α < 0 shifts downward, and α = 0 means no warping.
information loss at the boundaries. Preserving bandwidth is more important for
nonlinear frequency warping that for linear frequency warping, because one unit
shift in Mel- or Bark-scale could correspond to hundreds of Hz deviation in linear
scale due to the Mel- and Bark-nonlinearity.
3.7 Experiments with Nonlinear Frequency Warping
3.7.1 Sg2 based Nonlinear Frequency Warping
The automatically estimated Sg2 has been applied to linear frequency warping
and shown to be promising. Here, that work is extended to nonlinear speaker
normalization. Given the Sg2 value for a test speaker, Sg2tst, and a reference Sg2
value Sg2ref , which is the average Sg2 value over training speakers, the warping
83
factor α is calculated as:⎧⎪⎪⎪⎪⎪⎨⎪⎪⎪⎪⎪⎩
α =Sg2ref
Sg2tst, for linear scaling
α = Mel(Sg2ref) − Mel(Sg2tst), for Mel shift
α = Bark(Sg2ref) − Bark(Sg2tst), for Bark shift
(3.17)
The ML-based speaker normalization method (Eq. (3.8)) involves an exhaus-
tive grid search to find an optimal warping factor in the ML sense, which is time
consuming and requires a certain amount of data to be effective. In contrast,
the main computational cost for Sg2-based normalization methods comes from
F2 and E2 tracking based on LPC analysis, which can be done efficiently. Since
Sg2 has been shown to be content independent and remains constant for a given
speaker [65,78], Sg2 estimation doesn’t require large amounts of data, and theo-
retically a few words, or even one word if carefully chosen4, would be sufficient.
Therefore, compared to ML-based normalization methods, Sg2-based normaliza-
tion methods are computationally more efficient and require less data, which is
desirable for rapid speaker normalization with limited enrollment data.
3.7.2 Experimental Setup
For computational efficiency, all normalization methods are implemented by mod-
ifying the Mel or Bark filter bank analysis, instead of warping the power spectrum.
MFCC features are used for Mel shift, and PLPCC features are used for Bark
shift. PLP features can also be computed from Mel filter bank front end. Prelimi-
nary experiments showed that for the baseline system, Mel-PLP performs slightly
better than Bark scale PLP and standard MFCC. However, the improvement is
not significant, and since feature comparison is not the primary interest of this
4For most reliable estimation, the Sg2 detector requires F2 transition crossing Sg2, e.g., asin a diphthong /ai/.
84
work, standard MFCC and Bark-scale PLP were used in all experiments. Here
the focus is on the comparison of linear vs. nonlinear warping functions, and ML
vs. Sg2 based normalization. For fair comparisons, all experiments (both linear
and nonlinear) use piecewise warping functions with the same cut-off frequencies.
It is also important to use a consistent framework when conducting the com-
parison of ML-based linear vs. nonlinear normalization, i.e., the search grids
should be equivalent. This means the grid size should be the same and within an
appropriate range to ensure that the linear and nonlinear warped spectra cover
roughly the same frequency range. For the linear warping function, a grid of 21
searching points is used with a step size of 0.01. According to Eq. (3.15) and Eq.
(3.11), a step size of 0.01 in linear scaling roughly corresponds to a shift of 0.07
(bark) in Bark scale, or a shift of 10 (Mel) in Mel scale.
The performance of different normalization methods is evaluated on children’s
ASR, where speaker normalization has been shown to provide significant perfor-
mance improvement. Two databases are used: one is the TIDIGITS database on
connected digits, and the other is the TBall database on high frequency words
(HFW) and basic phonic skills test (BPST) words [77]. For the two databases,
speech signals were segmented into 25ms frames, with a 10ms shift. Each frame
was parameterized by a 39-dimensional feature vector consisting of 12 static
MFCC (or PLPCC) plus log energy, and their first- and second-order deriva-
tives. Cepstral mean substraction (CMS) is applied in all cases. Throughout this
section word error rate (WER) is used for performance evaluation.
Monophone-based acoustic models were used with 3 states and 6 Gaussian
mixtures in each state. In the TIDIGITS experiments, acoustic models were
trained on 55 adult male speakers and tested on 50 children. The baseline WER is
37.63% using MFCC features and 37.47% using PLPCC features. For each child,
85
the normalization data, which consisted of 1, 4, 7, 10 or 15 digits, were randomly
chosen from the test subset to estimate Sg2 and the ML-based warping factors.
The ML search grid is [0.8, 1.0] for linear scaling, [-1.4, 0.0] for Bark shifting, and
[-200, 0.0] for Mel shifting. In the TBall database, 55 HFW words and 55 BPST
words were collected from 189 children in grades 1 or 2. Around two-thirds of
the data (120) were used for training, and the remaining third for testing. The
baseline WER is 7.75% using MFCC features and 8.35% using PLPCC features.
Three randomly chosen words (including at least one diphthong word) were used
for normalization. The ML search grid is [0.9, 1.1] for linear scaling, [-0.7, 0.7]
for Bark shifting, and [-100, 100] for Mel shifting. For comparison, the Bark
offset method in [82] was also evaluated using PLPCC features. All experiments
were performed in an unsupervised way, and the recognition output from the
baseline models (without normalization) was used as transcription during ML
grid searching.
3.7.3 Experimental Results
Tables 3.5 and 3.6 show results on TIDIGITS with various amounts of normal-
ization data for MFCC and PLPCC features, respectively. LS-ML means linear
scaling with ML-based warping factor estimation; LS-Sg2 means linear scaling
with Sg2-based warping factor estimation; MS represents Mel-shift based non-
linear warping; BS is Bark-shift based nonlinear warping; BO-ML is the method
in [82] using ML grid search.
For ML-based warping methods, comparing LS vs. MS for MFCC (rows 1 and
2 in Table 3.5) and LS vs. BS for PLPCC features (rows 1 and 2 in Table 3.6), it
can be seen that nonlinear frequency warping provides better performance than
linear warping in all conditions, which is in agreement with literature. Due to
86
the bandwidth compensation, the proposed piecewise Bark shift method (BS-ML)
outperforms BO-ML except for the case of one normalization digit.
Compared to ML-based methods, Sg2 normalization performs significantly
better for up to seven normalization digits with all three warping functions (LS,
MS, and BS). With more data, ML-based methods tend to produce close or
superior performance, though for the case of Bark shift (BS-ML vs. BS-Sg2,
rows 3 and 5 in Table 3.6), Sg2 outperforms ML in all testing conditions for up
to 15 digits. Similar performance trends are observed on TBall data in Table 3.7.
Table 3.5: WER on TIDIGITS using MFCC features with varying normalization
data from 1 to 15 digits.
Warping 1 4 7 10 15
LS-ML 7.48 6.34 5.42 4.99 4.91
MS-ML 6.33 5.47 4.48 4.11 4.08
LS-Sg2 6.11 5.57 5.05 5.07 5.03
MS-Sg2 5.29 4.81 4.05 4.13 3.99
Table 3.6: WER on TIDIGITS using PLPCC features with varying normalization
data from 1 to 15 digits.
Warping 1 4 7 10 15
LS-ML 7.62 6.90 5.78 5.64 5.25
BS-ML 6.21 5.63 4.56 4.30 4.13
BO-ML 6.00 5.94 5.33 4.96 4.65
LS-Sg2 6.15 5.71 5.51 5.47 5.39
BS-Sg2 5.17 4.76 4.09 4.11 4.05
87
Table 3.7: WER on TBall children’s data using MFCC and PLPCC features with
3 normalization words.
MFCC PLPCC
Warping WER Warping WER
LS-ML 6.86 LS-ML 6.99
MS-ML 5.91 BS-ML 5.82
- - BO-ML 6.08
LS-Sg2 6.10 LS-Sg2 6.33
MS-Sg2 4.89 BS-Sg2 4.71
3.8 Summary and Discussion
This chapter presents a reliable algorithm for estimating the second subglottal
resonance (Sg2) from acoustic signals. The algorithm provides Sg2 estimates
which are very close to actual Sg2 values as determined from direct measurements
using accelerometer data. With the proposed algorithm, Sg2 standard deviation
across content and language was investigated with children’s data for English
and Spanish. Analysis shows that for a given speaker the second subglottal
resonance does not appear to vary with speech sounds, repetitions, and even
across languages. Based on such observations, a speaker normalization method
is proposed using the second subglottal resonance. This normalization method
defines the warping factor as the ratio of the reference subglottal resonance over
that of the test speaker.
A variety of evaluations show that the second subglottal resonance normaliza-
tion performs better than or comparable to VTLN, especially for limited adapta-
tion data. An obvious advantage of this method is that the subglottal resonances
remain roughly constant for a specific speaker. This method is potentially inde-
88
pendent of the amount of available adaptation data, which makes it suitable for
limited data adaptation.
Cross-language experimental results show that Sg2 normalization is more ro-
bust across languages than VTLN, and no significant performance variations are
observed for Sg2 when the adaptation data are changed from English to Span-
ish. The fact that Sg2 is independent of language should make it possible to
adapt acoustic models with available data from any language. The method is
also computationally more efficient than VTLN.
The Sg2 variations found in this work are similar to what has been reported
elsewhere. However, given the small number of subglottal resonance studies, more
data may need to be collected and analyzed in order to refine the characterization
of subglottal resonance variability. Future work is required to further improve
the accuracy of the Sg2 detector, evaluate the effectiveness of this method on a
large vocabulary database, and test the performance in noisy conditions.
89
CHAPTER 4
Automatic Evaluation of Children’s
Language Learning Skills
Increasing attention has been devoted to applying automatic speech recognition
techniques to children’s speech for educational purposes. Many automatic assess-
ment, tutoring, and computer aided language learning (CALL) systems have been
developed. This chapter describes an automated evaluation system developed in
response to the growing need for reliable and objective reading assessments in
schools. The system applies disfluency detection and Spanish accent detection
together with speech recognition to evaluate children’s langauge learning skills.
4.1 Technology based Assessment of Language and Liter-
acy
The technology-based assessment of language and literacy (TBall) project [39]
was designed to automatically evaluate English language learning and literacy
skills of predominantly Mexican-American children in grades K-2 (ages 5-7 years).
The goal is to use classroom-based assessments to inform reading instruction, en-
abling teachers to gather data about a large number of discrete skills including
phonological awareness, alphabet knowledge, word decoding, fluency, vocabulary,
and comprehension skills. A system designed to robustly meet these broad de-
mands must make use of multiple information sources when eliciting responses
90
from the children, automatically processing these responses, and reporting as-
sessment scores to the teachers.
The TBall system provides teachers of grades K-2 with a tool that allows
them to efficiently gather data about their students’ language skills from reliable
classroom-based assessments in order to plan individualized instruction tailored
for each child’s needs. The developed system consists of three main parts:
1. A multimedia student interface to present stimuli in audio, text, and graph-
ics, and to collect data over various sources and modes.
2. An assessment module using ASR to analyze and score the students’ re-
sponses in a reliable, fair and efficient manner.
3. A teacher interface to monitor students’ progress, and to create query based
database for students, groups and classes.
4.2 Blending Tasks and Database Collections
4.2.1 Blending Tasks for Phonemic Awareness
A critical component of the TBall project is assessment of phonemic awareness
because of its key role in reading and writing, especially for the targeted age
group. Phonemic awareness is related to developing reading and writing skills,
and is important for children to master to become proficient readers. It can be
assessed through oral segmenting and blending tasks at various linguistic lev-
els. Examples of blending and segmentation tasks are shown in Tables 4.1 and
4.2, respectively. Here the primarily focus is on phoneme blending, onset-rhyme
blending and syllable blending. The blending tasks assess both pronunciation ac-
curacy and smoothness of the target words. In the tasks, audio prompts present
91
phonemes, onset-rhymes, or syllables separately, and a child is asked to orally
blend them into a whole word. A child is said to be proficient in the tasks
provided:
• The child reproduces all the sounds of the original prompts (phonemes,
onset-rhymes, or syllables) in the final word.
• The child can smoothly blend the prompts together to make one word.
Table 4.1: An example of the TBall blending tasks: audio prompts are presented
and a child is asked to orally blend them into a whole word. A one-second silence
(SIL) is used within the prompts to separate each sound.
Blending task Audio Prompt Target
Phonemes /hh/ SIL /ae/ SIL /ch/ hatch
Onset-rhyme /r/ SIL /ae m p/ (r+amp) ramp
Syllables /p eh p/ SIL /t I k/ (pep+tic) peptic
Table 4.2: An example of the TBall segmentation tasks: audio prompts are
presented and a child is asked to orally segment them into parts.
Segmentation task Audio Prompt Target
Phonemes chime ch + i + me
Onset-rhyme shake sh + ake
Syllables station sta + tion
4.2.2 Database Collections
The speech corpus was collected in five Kindergarten classrooms in Los Angeles.
The schools were carefully chosen to provide balanced data from children whose
92
native language was either English or Mexican Spanish. Each blending task has
eight words, most of which are unfamiliar words to young children. Table 4.3 lists
the target words for each blending task. By choosing such unfamiliar words, the
intention is to reduce the likelihood that a child could guess the target answer
without focusing on blending the components.
Before the actual recording started, children first practiced on examples to
become familiar with the task, and they decided when they were ready to start
the recordings. During data collection, a timer with expiration time of three
seconds was used as the maximum pause between the prompt and the answer. If
a child didn’t respond within 3s after the prompt, the prompt for the next word
would be presented. A total of 193 children were recorded, and Table 4.4 shows
the distribution of children by native language and gender. The database was
roughly gender-balanced and also language-balanced.
Table 4.3: Target words for the blending tasks.
Blending task Target words
Phonemes pick, fan, ship, cash, lack, fad, shin, hatch
Onset-rhyme pot, mat, gum, shine, ramp, nit, chad, shape
Syllables bamboo, napkin, nova, peptic, stable, table, wafer, window
Table 4.4: Speaker distribution by native language and gender.
Native language English Spanish Unknown
Boy 38 43 11
Girl 41 47 13
Total 79 90 24
93
4.3 Human Evaluations and Discussions
4.3.1 Web-based Teacher’s Assessment
In our previous work [40], it was found that evaluations based on several words
from a speaker are more reliable than those based on a single word, since the more
speech from a child the rater hears, the more familiar the rater becomes with the
system of contrasts used by the child. For example, hearing a child say wick
for rick may indicate an articulation issue and not a phonemic awareness issue.
Therefore, in the web-based teacher’s assessment, audio samples were grouped by
speaker to allow teachers to apply speaker-specific information (dialect or accent,
speaking-rate, etc.) for judgment adaptation. Such speaker-specific information,
however, may lead to biased evaluations since dialect or accent, if any, is highly
subjective and thus people may perceive it differently.
Teachers assessed both pronunciation accuracy and smoothness by responding
to the following questions:
• Are the sounds correctly pronounced? (accuracy)
• Are the sounds smoothly blended? (smoothness)
• Is the final word acceptable? (overall)
For each question, two choices were presented to classify the quality: acceptable
or unacceptable. Teachers also provided comments for their decisions.
4.3.2 Inter-correlation of the Assessment
Assessments from nine teachers were collected to calculate the inter-correlation
between evaluators. As shown in Table 4.5, teacher evaluations are reasonably
94
consistent for the three tasks. The inter-correlations in evaluating the overall
quality are similar for all the tasks: about 85%. The inter-correlations on ac-
curacy evaluations are significantly higher than those on smoothness. This is
because, compared to pronunciation accuracy, smoothness evaluation is more
subjective especially toward short utterances. However, smoothness may be more
important than accuracy in the blending task because that is the goal of a blend-
ing assessment. In any case, it is an orthogonal judgment because words can be
smooth and accurate, not smooth and accurate, smooth and inaccurate or not
smooth and inaccurate. Of the three tasks, phoneme blending is the most dif-
ficult for children and draws much disagreement among teachers; while syllable
blending is relatively easy.
Table 4.5: Average inter-evaluator correlation on pronunciation accuracy,
smoothness and overall evaluations on three blending tasks.
Blending task Accuracy Smoothness Overall
Phonemes 87.6 80.8 83.3
Onset-rhyme 91.3 82.4 84.1
Syllables 97.5 85.3 86.7
4.3.3 Discussions on the Blending Target Words
From teachers’ comments, it is found that children’s background knowledge of
the task words greatly affects their performance. For unfamiliar target words, it
usually takes longer for a child to give the answer. For example, many children
are unfamiliar with peptic and with the unusual occurrence of /p/ and /t/
sounds together. In this case, there will typically be long pauses between the
end of a prompt and a child’s answer, and also between the two syllables to be
blended.
95
Another issue is for confusable target words: children tend to pronounce them
incorrectly but blend them smoothly, and thus show “strong blending” skills. For
the word stable many children pronounced it as staple because the two words
are very confusable especially when spoken in isolation without any context.
The confusion is particularly strong for Hispanic children learning English, since
Spanish /p/ can be acoustically similar to English /b/.
There are also some “language-driven” errors. That is, substitution or dele-
tion/insertion errors can occur when the syllables to be blended do not exist in
the child’s native language. For example, children from Spanish linguistic back-
grounds tended to pronounce the word stable as estable or estaple because no
words begin with the sound sp in Spanish and they always have a vowel preceding
the consonant cluster, such as the Spanish words Espana or esperanza.
To be consistent with the goals of this syllable blending task, the final decision
is based on both the pronunciation correctness and the blending smoothness, i.e.,
a word can be acceptable only when the pronunciation accuracy and the blending
smoothness are both acceptable.
4.4 Automatic Evaluation System
4.4.1 Overall System Flowchart
The automatic evaluation system to measure children’s performance on the blend-
ing tasks consists of four core components: disfluency detection, accent detection,
accuracy assessment and smoothness assessment. The system flowchart is shown
in Fig. 4.1, with detailed descriptions in subsequent sections. Disfluency detec-
tion uses the partial-word recognizer to filter out disfluent phenomena such as
false-starts, sounding out, self-repair and repetitions, and to localize the target
96
answer. Accent detection is then applied to the target word to detect possible
non-native English pronunciations. The accent information is then used to up-
date the pronunciation dictionaries and duration ratio models. Normalized log
likelihood and duration ratio scores are used to measure accuracy and smooth-
ness, respectively. These two scores are combined together to get the final result.
Since the tasks are designed to evaluate a child’s language learning skills
based on responses to audio prompts, prior information of the expected answer is
available for use in ASR. Hence, the automatic system can work in a supervised
mode and exploit knowledge-based information derived by linguistic experts for
better and more reliable performance.
Figure 4.1: Flowchart of the automatic evaluation system for the blending tasks.
4.4.2 Disfluency Detection
Generally speaking, disfluencies include everything spoken by the child that dis-
rupts the natural flow of the target word pronunciation. Typical disfluencies
found in our data are: fillers such as uhhh or ummm, partial- and/or full-word
repetitions where syllables or phonemes within a word or a whole word are re-
peated, self-corrections, long pauses within a word, elongations where syllables
or phonemes (usually the first one) are lengthened. These last two disfluencies
97
(pauses and elongations) are related to the smoothness measure, and will be
addressed using duration ratio models.
Disfluency detection as the first stage of the system is used mainly to filter out
fillers and repetitions, and to get the approximate beginning and ending times
for the target answer. If the target word is repeated several times, only the
last one is used for further evaluations in order to be consistent with teachers’
decision-making protocols, where only the last answer is accounted for.
A partial-word recognizer (PWR) [48] is used to detect disfluency with sub-
word units derived from the dictionary based on the task; sub-word units are
phonemes, onsets or rhymes, or syllables depending on the blending task. An
example of the detection network is shown in Fig. 4.2 for a syllable blending
word peptic. A background/garbage model is used to consume background
noises, fillers and out-of-vocabulary words. Long pauses are allowed between
sub-word units. The PWR can be bypassed to whole word recognition (WWR)
for disfluency-free speech. The WWR is a regular phoneme-based recognizer ex-
cept that it allows repetitions. WWR can also be bypassed for the case where
the child does not make an attempt to say the target whole word.
For computational efficiency, only one canonical dictionary pronunciation is
used to generate sub-word units, and no accented alternatives are taken into
account at this state. This is reasonable because here the disfluency detector is
mainly used to localize the target answer of interest (not score it). Evaluated
on a subset of the blending tasks data, the disfluency detector is able to filter
out around 85% of the disfluent miscues. The subsequent process detects accent
and uses that information to choose the pronunciation dictionary and duration
models.
98
Figure 4.2: An example of the disfluency detection network for a syllable blending
task word ‘peptic’, where START and END are the network enter and exit points,
respectively.
4.4.3 Accent Detection
The TBall data used here were collected from children with multi-lingual back-
grounds, and thus contain foreign accented (mainly Spanish accented) English.
An example of pronunciation variation for Spanish-accented English is the re-
placement of /dh/ (there) with /d/ (dare), since /dh/ does not exist in Spanish.
The pronunciation patterns of Spanish-accented English can be predicted from
theories of second language learning. The phonological rules are divided into three
categories: consonant changes, vowel changes, and syllable structure changes
(insertions or deletions of consonants or vowels).
Possible consonant changes include final obstruent devoicing, interdental frica-
tive change, palatalization, retroflexing, alveolar approximate change, dentaliza-
tion and labialization, etc. Particularly, the following list (similar to that in [84]
and the online resource the speech accent archive [85]) enumerates pronunciation
patterns that are within the phonemic coverage of our analysis database.
• /v/ (vat) → /f/ (fat), because /v/ doesn’t exist in Spanish, and /f/ is
acoustically similar to /v/.
• /z/ (zoo) → /s/ (sat), because /z/ doesn’t exist in Spanish, and /s/ is
acoustically similar to /z/.
99
• /dh/ (that) → /d/ (debt), because /dh/ doesn’t exist in Spanish, and /dh/
is acoustically close to the Spanish /d/.
• /th/ (thing) → /t/ (ten), because /th/ doesn’t exist in Spanish, and /th/
is acoustically close to the Spanish /t/.
• /r/ (rent) → trill, pronounce /r/ with a rolling sound.
• Unaspirated /p/ (pet), /t/ (ten), /k/ (kit), pronounce without the aspira-
tion.
Possible vowel changes include vowel shortening, vowel raising, and vowel
lowering. Potential pronunciation patterns are summarized as follows:
• /ih/ (bit) → /iy/ (beat), because /ih/ doesn’t exist in Spanish, and /iy/ is
acoustically close to /ih/.
• /ae/ (bat) → /eh/ (bet), because /ae/ doesn’t exist in Spanish.
• /uh/ (book) → /uw/ (boot), because /uh/ doesn’t exist in Spanish.
• /ah/ (but) → /eh/ (bet), because /ah/ doesn’t exist in Spanish.
Possible syllable structure changes include vowel insertion, consonant deletion
(/r/ deletion), and consonant insertion. Two such patterns are observed in our
analysis database: /er/ (bird) → /aa r/ (are), and /ao r/ (four) → /aa/ (Bob).
Since the focus in this work is on Spanish-accented English from young chil-
dren of the age 5-7 years, while the language learning theories are mainly based
on adult subjects, the pronunciation patterns for children may be different from
what the theories predicted. Therefore, a pronunciation variation study was
carried out on the TBall database [84], with the main results summarized here
100
Table 4.6: Pronunciation variants analysis for consonants and vowels on a Span-
ish-accent English database, with the percentage of occurrence in the analysis
database shown in the parentheses. Entries with a tailing asterisk are those
change patterns not predicted from theories.
Consonant changes Vowel changes
/z/ → /s/ (73.6) /ih/ → /iy/ (33.4)
/th/ → /d/ (34.6)* /uh/ → /uw/ (32.8)
/dh/ → /d/ (29.7) /ae/ → /eh/ (11.7)
/d/ → /t/ (22.4)* /ah/ → /eh/ (10.1)
/ch/ → /sh/ (22.2)*
/v/ → /f/ (21.6)
/ng/ → /n/ (17.0)*
in Table 4.6. These analysis results confirm most of the predicted hypotheses,
but also show some new observations. Listed here are only those patterns with
occurrence probability larger than 10% in the database. Some predicted pronun-
ciation variation hypotheses are not observed in the analysis database, including
consonant changes trill /r/ and unaspirated /p/, /t/, /k/. The consonant change
(/th/ → /t/) does occur in the database, but with a rather low probability (9.2%);
while another variant of /th/ (/th/ → /d/) occurs at a much higher probability
(34.6%). The hypothesized structure changes (phoneme insertion, deletion) hap-
pen at a significantly low probability (<0.1%), which may not be well generalized.
See [84] for a complete list of patterns and more detailed explanations.
Based on the analysis, an algorithm is developed to automatically detect
Spanish accent. Given the pronunciation variation patterns, a simple but effec-
tive measure for accent detection is the occurrence ratio of such patterns in an
101
utterance, defined as:
Rph1|ph2 =C(ph2 → ph1)
C(ph2)(4.1)
where Rph1|ph2 is the occurrence ratio of pronunciation change pattern from one
phoneme (ph2) to another phoneme (ph1), which is denoted by {ph2 → ph1};
C(ph2) is the occurrence count (OC) of ph2, and C(ph2 → ph1) is the OC
of pattern {ph2 → ph1}. Since the system is running in a supervised mode
with available transcriptions, the OCs can be easily calculated through forced
alignment using a canonical pronunciation dictionary first and then an accented
pronunciation dictionary. The two alignment outputs are analyzed and compared
to calculate the OC of each pattern.
The average value of all occurring pattern ratios is a measure of the overall
accent of a speaker, i.e.,
R =1
M
∑{ph2→ph1}∈P
Rph1|ph2 (4.2)
where P represents all valid pronunciation change patterns, and M is the to-
tal number of patterns occurring in the utterance. To make reliable estimates,
patterns with OCs of C(ph2) below a threshold of 3, are not included in the
calculation.
The speaker level accent measure in Eq. (4.2) treats all pronunciation change
patterns equally. A statistical analysis of the data, however, shows patterns do
not occur with the same probability, and some patterns occur much more fre-
quently than others, e.g., the occurrence of pattern {/z/→/s/} has a probability
of 73.6%, while the pattern{/v/→/f/} occurs only 21.5% of time. The occurrence
probabilities can be viewed as the correlation between each pattern and the over-
all accent. The higher the probability, the more related the pattern is to accent.
102
To take this into account, Eq. (4.2) is changed into the following equation:
R =∑
{ph2→ph1}∈P
p(ph2 → ph1) · Rph1|ph2 (4.3)
where p(ph2 → ph1) is the probability of pattern {ph2 → ph1}, which is normal-
ized to make the summation of all pattern probabilities equal to 1.
The accent score from Eq. (4.3) is used to classify a speaker’s accent level.
The higher the score, the more accented the utterance is. Since our database
does not have accent level information, a binary detection is performed to decide
if a speaker is Spanish-accented or not. Given a threshold Ta (0.6 in the following
experiments), if the score R is greater than Ta, then the speaker has Spanish
accent. The accent detector achieved 83% correctness on an evaluation dataset
which was labeled for accent.
4.4.4 Pronunciation Dictionary
The dictionary used in accuracy assessment needs to consider possible pronuncia-
tion variations. Besides the canonical pronunciation for each word, the dictionary
also contains entries for non-canonical but correct (and common in kids) pronun-
ciations from different dialects common in the Los Angeles area. For example,
many speakers do not distinguish cot and caught, pronouncing both as /k aa t/.
Therefore, /k aa t/ and /k ao t/ are both considered as correct pronunciations.
The dictionary also includes iy/ih alternations since Spanish learners of English
often do not separate them well. Hispanic letter to sound (LTS) rules are not
applied in the dictionary, since LTS rules are for reading evaluations while in our
task the prompts are auditory. Although it is possible that these rules may have
some effect (since they hear speech of adults who are literate and influenced by
Hispanic LTS rules when speaking English), such instances appeared to be rare
103
relative to the increase in size of the dictionary that would be needed to cover
them comprehensively.
The pronunciations in the dictionary have tags for these various pronunci-
ations (dialected pronunciation, canonical pronunciation, phonological develop-
ment issue, etc.) In this way, “dialect” or “idiolect” can be attributed in a simple
way: the likelihood for each pronunciation is calculated and the pronunciation
with the highest likelihood, if non-canonical, is declared as the “idiolect” for the
speaker for that word. A pattern of many words through the dialected path
would confirm a speaker as having dialected speech. A constraint for detecting
dialect is that the speaker must produce a consistent dialect, that is, the dialect,
if detectable, must be the same in most of the task words. In this way, we can
model the dialect as a system of distinctions, which is linguistically much more
appropriate.
4.4.5 Accuracy and Smoothness Measurements
Normalized HMM log likelihoods through forced alignments are calculated to
evaluate the pronunciation qualities. Accent information from the accent de-
tection component is used to choose appropriate entries from the pronunciation
dictionary. Local normalization is applied to compensate for utterance length
(time duration):
Sl =1
N
N∑i=1
si
di
(4.4)
where si is the log likelihood of the ith segment (phoneme, onset, rhyme, syllable,
or the pause between), di is the corresponding time duration in frames, and the
summation is over all N segments. The pronunciation is acceptable if the log
likelihood score Sl > Tl, where the threshold Tl can be speaker-independent
empirical values or speaker-specific values to take into consideration individual
104
speaker’s acoustic characteristics.
Segment durations are used to measure the blending smoothness. The dura-
tions are obtained from forced alignments with the most likely pronunciations.
To compensate for the effects of rate of speech, the durations are normalized as:
di = di/
N∑j=1
dj (4.5)
Gaussian mixture models (GMM) are used to approximate the distribution of
syllable duration ratios for each task word. Two GMM models are constructed
from the training data, one for native English and the other for Spanish-accented
English. Information from the accent detection component is used to select the
appropriate model. The log likelihood of given duration ratios against the GMM
is used as smoothness scores Sd:
Sd =∑
i
log∑m
N (di; μim, σim) (4.6)
where N (·; μim, σim) is the mth Gaussian mixture with mean μim and variance
σim for the ith segment. If Sd is greater than the smoothness threshold Td, the
blending smoothness is acceptable.
4.4.6 Overall quality measurement
The overall quality is unacceptable if either pronunciation or smoothness is un-
acceptable. If the pronunciation and smoothness are both acceptable, the overall
quality is evaluated based on the weighted summation of pronunciation scores
and smoothness scores:
S = w · Sl + (1 − w) · Sd (4.7)
A threshold T is used to decide the acceptability of the overall quality. Similar
to pronunciation evaluation, T can be speaker-independent or speaker-specific.
105
4.5 Experimental Results
To test system performance, evaluations from teachers were used as references.
Acoustic monophone models were trained on the TBall database (excluding the
blending tasks) with approximately seven hours annotated recordings from both
native and nonnative speaker. For each blending task, performance was tested
on 1350 utterances. Speaker independent decision thresholds were used in all
experiments. Table 4.7 shows the correlation between automatic and average
teacher evaluations for the three blending tasks.
For pronunciation quality evaluation, normalized likelihoods correlate well
with teacher assessments. For the smoothness measurement, duration ratio scores
achieved comparable performance to the average inter-correlation between teach-
ers. The overall evaluation using a weighted summation of pronunciation and
smoothness scores obtained an average correlation around 88% over the three
tasks, slightly better than the average inter-teacher correlation. The weight of
the optimal performance is 0.35, which means that smoothness is more important
than pronunciation in the blending task. Note that on the syllable blending task,
overall performance is improved from 87.5% (in [40]) to 91.8% due to disfluency
and accent detection.
Table 4.7: Average correlation between ASR and teacher evaluations on pronun-
ciation accuracy, smoothness and overall qualities for three blending tasks.
Blending task Accuracy Smoothness Overall
Phonemes 90.5 79.8 85.6
Onset-rhythm 93.2 83.1 87.9
Syllables 95.4 90.7 91.8
106
4.6 Summary and Discussion
An automatic evaluation system is developed to assess children’s performance on
three blending tasks. The system applies disfluency detection and accent detec-
tion for pre-processing and uses a pronunciation dictionary for forced alignment to
generate sound segmentations and produce HMM likelihood scores. The weighted
summation of normalized likelihoods and duration scores is used to evaluate the
overall quality of children’s responses. Speaker specific accent information is used
to update the dictionary and duration ratio models. Compared to teachers’ as-
sessments, the system achieves a correlation better than the average inter-teacher
correlation.
107
CHAPTER 5
Summary and Future Work
5.1 Summary and Discussion
This dissertation investigates rapid speaker normalization and adaptation meth-
ods to reduce speaker variations in automatic speech recognition (ASR) systems.
Two methods are developed based on the supraglottal (vocal tract) resonances
(formants), and the subglottal acoustic system resonances, respectively. As an
application, an automatic system is developed using ASR to evaluate children’s
language learning skills.
Chapter 2 presents a rapid speaker adaptation method using regression-tree
based spectral peak alignment. Based on the analysis of various levels of spectral
mismatch in formant structures, this method is proposed to reduce phoneme-
and state-level spectral mismatch. With the linearization of frequency warping
in the cepstral domain, the method is investigated in a maximum likelihood lin-
ear regression (MLLR)-like framework, where the transformation matrix is gen-
erated deterministically by aligning phoneme- or state-level formant-like peaks
in the spectrum, while bias and covariance are statistically estimated using the
Expectation Maximization (EM) algorithm. This method can be viewed as a
combination of vocal tract length normalization (VTLN) and MLLR, taking ad-
vantage of both the efficiency of VTLN and the reliability of MLLR, with the
potential of good performance for both limited and large amounts of adaptation
108
data. Experimental results show that this method can achieve significant per-
formance improvements over MLLR, especially for cases where only a very small
amount of data is available.
Chapter 3 develops a speaker normalization method using the subglottal res-
onances. Based on the coupling effects of subglottal acoustic system to the vocal
tract cavity, a reliable algorithm is proposed to automatically estimate the second
subglottal resonance (Sg2) from speech signals using joint frequency discontinu-
ity and energy attenuation measurements. The algorithm provides Sg2 estimates
which are very close to the ground truth as determined from direct measurements
using accelerometer data. With the proposed algorithm, analysis is conducted on
speech data from Spanish-English bilingual children speakers to investigate the
content and language variability of the subglottal resonances. It is shown that
for a given speaker the second subglottal resonance does not appear to vary with
speech sounds, repetitions, and even across languages. Based on such observa-
tions, a speaker normalization method is proposed using the second subglottal
resonance. This normalization method defines the warping factor as the ratio
of the reference subglottal resonance over that of the test speaker. A variety
of evaluations show that the second subglottal resonance normalization method
performs better than or comparable to VTLN, especially for limited adaptation
data and cross langauge adaptation. An obvious advantage of this method is
that the subglottal resonances remain roughly constant for a specific speaker and
thus this method is potentially independent of the amount of available adaptation
data and language.
Chapter 4 introduces ASR techniques to evaluate children’s langauge learning
skills through blending and segmentation tasks. The challenges stem from the
children’s young age, their multilingual background and frequent occurrence of
109
disfluency. Accordingly, the system applies speaker normalization, disfluency de-
tection and accent detection for pre-processing to localize possible valid responses.
The target response is then passed to an ASR system, which uses a pronuncia-
tion dictionary for forced alignment to generate sound segment durations and to
produce HMM likelihood scores. Normalization likelihood scores and duration
scores are used to measure pronunciation accuracy and smoothness, respectively.
These two scores are then combined to assess the overall quality. Speaker-specific
accent information is used to update the dictionary and duration ratio models.
Compared to teachers’ assessments, the system achieves a correlation better than
the average inter-teacher correlation.
5.2 Future Work
Rapid speaker normalization and adaptation is an important issue for real-world
ASR systems to provide robust and satisfactory performance over a large distri-
bution of speakers. Because of the problem of data sparsity, knowledge based
methods usually tend to outperform purely data-driven statistical methods, as
shown in this dissertation for both spectral peak alignment and subglottal reso-
nance normalization. Such prior information can greatly reduce the requirement
of and the dependency on the amount of available data.
For future work, it is worth exploring the incorporation of acoustic and percep-
tual knowledge into the currently data-driven ASR systems. The prior knowledge
can guide statistical ASR system for smart and efficient training and decoding.
Besides information about formant structures and subglottal resonances as uti-
lized in this dissertation, acoustic phonetic knowledge such as classic phonetic
features of place, manner and voicing, and distinctive features may also be help-
ful for acoustic modeling to improve the model’s discriminative ability. Since
110
these features are developed to specify a phoneme and to describe speech sounds
in a particular langauge or dialect, incorporating them into ASR systems seems
promising as reported in [86].
The spectral peak alignment method is investigated in an MLLR-framework.
It is also possible to use Maximum A Posterior Linear Regression (MAPLR)
[57–59] to incorporate prior knowledge based on an empirical Bayes (EB) ap-
proach [57] and/or the structural information of the models [58]. Provided that
appropriate priors are chosen, MAPLR may significantly outperform MLLR.
The Sg2 variations are studied on a small data set in this dissertation. More
data may need to be collected and analyzed in order to refine the characterization
of subglottal resonance variability. Future work is also required to further improve
the accuracy of the Sg2 detector, evaluate the method on a large vocabulary
database and in noisy conditions.
For the automatic evaluation system, future work will aim to improve perfor-
mance using additional features and speaker-specific modeling.
111
References
[1] L.E. Baum and J.A. Eagon, “An inequality with applications to statisticalestimation for probabilistic functions of Markov processes and to a model forecology,” Bulletin of American Mathematical Society, vol. 73, pp. 360-363,1967.
[2] L. Rabiner and B.-H. Juang, Fundamentals of speech recognition, PrenticeHall, 1993.
[3] L. Baum, “An ineuqality and associated maximization technique in statisti-cal estimation for probabilistic functions of Markov processes,” Inequalities,vol. 3, pp. 1-8, 1972.
[4] A.J. Viterbi, “Error bounds for convolutional codes and an asymptoticallyoptimal decoding algorithm,” IEEE Trans. INformat. Theory, vol. IT-13,pp. 260-269, 1967.
[5] A.P. Dempster, N.M. Laird and D.B. Rubin, “Maximum likelihood fromincomplete data via the EM algorithm,” J. Roy. Stat. Soc., vol. 39(1), pp.1-38, 1977.
[6] J. Markel and A. Gary, Linear Prediction of Speech, Springer-Verlag, 1976.
[7] H. Hermansky, “Perceptual Linear Prediction (PLP) Analysis of Speech,”Journal of Acoustical Society of America, vol. 87(4), pp. 1738-1752, 1990.
[8] S. Davis and P. Mermelstein, “Comparison of parameteric representationsfor monosyllabic word recognition in continuously spoken sentences,” IEEE
Trans. on Acoustics, Speech, Signal Proc., vol. 28(4), pp. 357-366, 1980.
[9] X. Huang and K.F. Lee, “On speaker-independent, speaker-dependent, andspeaker-adaptive speech recognition,” IEEE Trans. Speech and Audio Pro-
cessing, vol. 1(2), pp. 150-157, 1993.
[10] H. Wakita, “Normalization of vowels by vocal tract length and its applicationto vowel identification,” IEEE Trans. Acoust., Speech, Signal Processing, vol.25, pp. 183-192, 1977.
[11] G. Fant, Acoustic Theory of Speech Production. The Hague, The Nether-lands: Mouton, 1960.
[12] V. Digalakis, D. Rtischev and L.G. Neumeyer, “Speaker adaptation usingconstrained estimation of Gaussian mixtures,” IEEE Trans. Speech Audio
Processing, vol. 3(5), pp. 357-366, 1995.
112
[13] V. Digalakis, S. Berkowitz, E. Bocchieri, C. Boulis and W. Byrnc, “Rapidspeech recognizer adaptation to new speakers,” in Proc. ICASSP, pp. 765-768, 1999.
[14] M.J.F. Gales, “Maximum likelihood linear transformations for HMM-basedspeech recognition,” Computer Speech and Language, vol. 12(2), pp. 75-98,1998.
[15] C.J. Leggetter and P.C. Woodland, “Maximum likelihood linear regressionfor speaker adaptation of continuous density hidden Markov models,” Com-
puter Speech and Language, vol. 9, pp. 171-185, 1995.
[16] L. Lee and R. Rose, “A frequency warping approach to speaker normaliza-tion,” IEEE Trans. Speech Audio Processing, vol. 6(1), pp. 49-60, 1998.
[17] S. Wegmann, D. McAllaster, J. Orloff and B. Peskin, “Speaker normalisationon conversational telephone speech,” in Proc. ICASSP, vol. I, pp. 339-341,1996.
[18] E. Eide and H. Gish, “A parametric approach to vocal tract length normal-ization,” in Proc. ICASSP, pp. 346-349, 1996.
[19] J. McDonough, W. Byrne and X. Luo, “Speaker normalization with all-passtransforms,” in Proc. ICSLP, vol. VI, pp. 2307-2310, 1998.
[20] J. McDonough, “Speaker compensation with all-pass transforms,” Ph.D. dis-sertation, Johns Hopkins University, 2000.
[21] J.L. Gauvain and C.H. Lee, “Maximum a posteriori estimation for multivari-ate Gaussian mixture observations of Markov chains,” IEEE Trans. Speech
Audio Processing, vol. 2(2), pp. 291-298, 1994.
[22] G. Ding, Y. Zhu, C. Li and B. Xu, “Implementing vocal tract length nor-malization in the MLLR framework,” in Proc. ICSLP, pp. 1389-1392, 2002.
[23] T. Claes, I. Dologlou, L. Bosch and D.V. Compernolle, “A novel featuretransformation for vocal tract length normalization in automatic speechrecognition,” IEEE Trans. Speech Audio Processsing, vol. 11(6), pp. 549-557, 1998.
[24] J. McDonough and W. Byrne, “Speaker adaptation with all-pass trans-forms,” in Proc. ICASSP, pp. 757-760, 1999.
[25] J. McDonough, T. Shaaf and A. Waibel, “Speaker adaptation with all-passtransforms,” Speech Commnication, vol. 42, pp. 75-91, 2004.
113
[26] J. McDonough and A. Waibel, “Performance comparisons of all-pass trans-form adaptation with maximum likelihood linear regression,” in Proc.
ICASSP, pp. I313-I316, 2004.
[27] M. Pitz and H. Ney, “Vocal tract normalization as linear transformation ofMFCC,” in Proc. Eur. Conf. Speech Communication and Technology, pp.1445-1448, 2003.
[28] S. Umesh, A. Zolnay and H. Ney, “Implementing frequency-warping andVTLN through linear transformation of conventional MFCC,” in Proc.
Interspeech-2005, pp. 269-272, 2005.
[29] X. Cui and A. Alwan, “MLLR-like speaker adaptation based on linearizationof VTLN with MFCC features,” in Proc. Interspeech-2005, pp. 273-276, 2005.
[30] X. Cui and A. Alwan, “Adaptation of children’s speech with limited databased on formant-like peak alignment,” Computer Speech and Language, vol.20(4), pp. 400-419, 2006.
[31] Gouvea, E. and Stern, R. (1997). “Speaker normalization through formant-based warping of the frequency scale,” in Proc. Eurospeech, pp. 1139-1142.
[32] Zhan, P. and Westphal, M. (1997). “Speaker normalization based on fre-quency warping,” in Proc. ICASSP, pp. 1039-1041.
[33] Wang, S., Cui, X. and Alwan, A. (2007). “Speaker Adaptation with LimitedData using Regression-Tree based Spectral Peak Alignment”, IEEE TASLP,Vol. 15, pp. 2454-2464.
[34] Wang, X., Wang, B. and Qi, D. (2004). “A bilinear transform approach forvocal tract length normalization,” in Proc. ICARCV, pp. 547-551.
[35] V. Zue, S. Seneff, J. Polifroni, H. Meng and J. Glass, “Multilingual human-computer interactions: from information acess to language learning,” in IC-
SLP’96.
[36] J. Wilpon and C. Jacobsen, “A study of speech recognition for children andelderly,” in Proc. ICASSP, vol. I, pp. 349-352, 1996.
[37] J. Mostow, G. Aist, P. Burkhead, A.Corbett, A. Cuneo, S. Eitelman, C.Huang, B. Junker, M.B. Sklar and B. Tobin, “Evaluation of an automatedreading tutor that listens: comparison to human tutoring and classroominstruction,” Journal of Educational Computing Research, vol. 29(1), pp.61-117, 2003.
114
[38] A. Hagen, B. Pellom and R. Cole, “Children’s speech recognition with ap-plications to interactive books and tutors,” in Proc. ASRU 2003.
[39] A. Alwan, et al., “A System for Technology Based Assessment of Languageand Literacy in Young Children: the Role of Multiple Information Source,”in Proc. IEEE MMSP, Greece, October 2007.
[40] S. Wang, et al., “Automatic evaluation of children’s performance on an En-glish syllable blending task,” SLaTE Workshop 2007.
[41] S. Lee, A. Potamianos and S. Narayanan, “Acoustic of children’s speech:developmental changes of temporal and spectral parameters, ” J. Acoust.
Soc. Am., vol. 105(3), pp. 1455-1468, 1999.
[42] J.E. Huber, E.T. Stathopoulos, G.M. Curione, T.A. Ash, and K. Johnson,“Formants of children women and men: The effect of vocal intensity varia-tion,” J. Acoust. Soc. Am., 106(3): 1532-1542, 1999.
[43] D. Elenius and M. Blomberg, “Comparing speech recognition for adults andchildren,” in Proc. FONETIK 2004.
[44] K. Lee, A. Hagen, N. Romanyshyn, S. Martin and B. Pellom, “Analysisand detection of reading miscues for interactive literacy tutors,” in Proc.
COLING, 2004.
[45] E. Shriberg, R. Bates and A. Stolcke, “A Prosody-Only Decision-Tree Modelfor Disfluency Detection,” in Proc. Eurospeech, pp. 2383-2386, 1997.
[46] Y. Liu, E. Shriberg and A. Stolcke, “Automatic Disfluency Identificationin Conversational Speech Using Multiple Knowledge Sources,” in Proc. Eu-
rospeech, pp. 957-960, 2003.
[47] M. Black, et al., “Automatic detection and classification of disfluent readingmiscues in young children’s speech for the purpose of assessment,” in Proc.
Interspeech, 2007.
[48] A. Hagen, B. Pellom and R. Cole, “Highly accurate children’s speech recog-nition for interactive reading tutors using subword units,” Speech Commu-
nication, vol. 49, pp. 861-873, 2007.
[49] T. Chen, C. Huang, E. Chang and J. Wang, “Automatic accent identificationusing gaussian mixture models,” in IEEE Workshop on ASRU, 2001.
[50] C. Teixeira, H. Franco, E. Shriberg, E. Precoda and K. Sonmez, “Evaluationof speaker’s degree of nativeness using text-independent prosodic features,”in Proc. of Multilingual Speech and Language Processing, 2001.
115
[51] T. Schultz, Q. Jin, K. Laskowski, A. Tribble and A. Waibel, “Speaker, accentand language identification using multilingual phone strings,” in HLT-2002.
[52] T. Anastasakos, J. McDonough, R. Schwartz and J. Makhoul, “A compactmodel for speaker-adaptive training,” in Proc. ICSLP, pp. 1137-1140, 1996.
[53] P. Price, W.M. Fisher, J. Bernstein and D.S. Pallett, “The DARPA 1000-word resource management database for continuous speech recognition,” inProc. ICASSP, vol. 1, pp. 651-654, 1998.
[54] R. Leonard, “A database for speaker-independent digit recognition,” in Proc.
ICASSP, vol. 9, pp. 328-331, 1984.
[55] P. Zolfaghari and T. Robinson, “Formant analysis using mixtures of Gaus-sians,” in Proc. ICSLP, pp. 1229-1232, 1996.
[56] S. Panchapagesan, “Frequency warping by linear transformation of standardMFCC,” in Proc. Interspeech, pp. 397-400, 2006.
[57] C. Chesta, O. Siohan and C. Lee, “Maximum a posterior linear regression forhidden Markov model adaptation”, in Proc. EuroSpeech, pp. 211-214, 1999.
[58] W. Chou and X. He, “Maximum a posterior linear regression (MAPLR)variance adaptation for continuous density HMMs”, in Proc. EuroSpeech,pp. 1513-1516, 2003.
[59] W. Chou, “Maximum a posterior linear regression with elliptically symmetricmatrix variate priors”, in Proc. EuroSpeech, pp. 1-4, 1999.
[60] L.R. Rabiner and R.W. Schafer, Digital Processing of Speech Signals, Pren-tice Hall, 1978.
[61] L. Gillick and S. Cox, “Some statistical issues in the comparison of speechrecognition algorithm,” in Proc. ICASSP, pp. 532-535, 1989.
[62] K.N. Stevens, Acoustic Phonetics, MIT Press, Cambridge, MA, 1998.
[63] S.M. Lulich, “Subglottal resonances and distinctive features,” J. Phon.,doi:10.1016/j.wocn.2008.10.006, 2008.
[64] S.M. Lulich, M. Zanartu, D.D. Mehta, R.E. Hillman, “Source-Filter Inter-action in the Opposite Direction: Subglottal Coupling and the Influence ofVocal Fold Mechanics on Vowel Spectra during the Closed Phase,” J. Acoust.
Soc. Am., vol. 125, pp. 2638, 2009.
116
[65] X. Chi and M. Sonderegger, “Subglottal coupling and its influence on vowelformants,” J. Acoust. Soc. Am., vol. 122(3), pp. 1735-1745, 2007.
[66] Y. Jung, S.M. Lulich, K.N. Stevens, “Development of subglottal quantaleffects in young children,” J. Acoust. Soc. Am., vol. 124, pp. 2519, 2008.
[67] S.M. Lulich, “The role of lower airway resonances in defining vowel featurecontrasts,” PhD Dissertation, MIT, Cambrige, MA, 2006.
[68] S.M. Lulich, A. Bachrach and N. Malyska, “A role for the second subglottalresonance in lexical access,” J. Acoust. Soc. Am., col. 122(4), pp. 2320-2327,2007.
[69] M. Sonderegger, “Sublottal coupling and vowel space: An investigation inquantal theory,” Undergraduate thesis, MIT, Cambrige, MA, 2004.
[70] H.A. Cheyne, “Estimating glottal voicing source characteristics by measur-ing and modeling the acceleration of the skin on the neck,” Ph.D. Disserta-tion, MIT, Cambrige, MA, 2001.
[71] The Snack Sound Toolkit, Royal Inst. Technol., Oct. 2005 [Online]. Avail-able: http://www.speech.kth.se/snack/ (date last viewd 8/10/08).
[72] K. Honda, S. Takano and H. Takemoto, “Effects of side cavities andtongue stabilization: Possible extensions of quantal theory,” J. Phon.,doi:10.1016/j.wocn.2008.11.002, 2008.
[73] H. Hanson and K.N. Stevens, “Subglottal resonances in female speakers andtheir effect on vowel spectra,” in Proc. XIIIth International Congress of
Phonectic Sciences, Stockholm, vol. 3, pp. 182-185, 1995.
[74] X. Chi and M. Sonderegger, “Subglottal coupling and vowel space,” J.
Acoust. Soc. Am., vol. 115(5), pp. 2540-2540, 2004.
[75] Y. Jung, “Acoustic articulatory evidence for quantal vowel categories acrosslanguages,” Poster presented at the Harvard-MIT HST Forum, 2008.
[76] A. Madsack, S.M. Lulich, W. Wokurek and G. Dogil, “Subglottal resonancesand vowel formant variability: a case study of high German monophthongsand Swabian diphthongs,” Lab. Phon. 11, 2008.
[77] A. Kazemzadeh, H. You, M. Iseli, B. Jones, X. Cui, M. Heritage, P. Price,E. Anderson, S. Narayanan and A. Alwan, “TBall Data Collection: TheMaking of a Young Children’s Speech Corpus”, in Proc. Eurospeech, pp.1581-1584, 2005.
117
[78] S. Wang, S.M. Lulich, and A. Alwan, “A reliable technique for detecting thesecond subglottal resonance and its use in cross-language speaker adapta-tion”, in Proc. Interspeech, pp. 1717-1720, 2008.
[79] S. Umesh, L. Cohen and D. Nelson, “Frequency warping and the Mel scale,”IEEE SPL, 9(3):104C107, 2002.
[80] R. Sinha and S. Umesh, “A shift-based approach to speaker normalization us-ing non-linear frequency-scaling model,” Speech Communication, 50 (2008):191-202, 2008.
[81] Y. Ono, H. Wakita and Y. Zhao, “Speaker normalization using constrainedspectra shifts in auditory filter domain,” in Proc. EUROSPEECH, pp.21C23, 1993.
[82] D. Burnett and M. Fanty, “Rapid unsupervised adaptation to children’sspeech on a connected-digit task,” in ICASSP, pp. 1145-1148, 1996.
[83] P. Zhan and A. Waibel, “Vocal tract length normalization for large vocabu-lary continuous speech recognition,” Tech. Rep. CMU-CS-97-148, CarnegieMellon University, May, 1997.
[84] H. You, et al., “Pronunciation Variation of Spanish-accented English Spokenby Young Children,” in Proc. Eurospeech, pp. 273-276, 2005.
[85] The Speech Accent Archive [Online]. Available: http://accent.gmu.edu/(date last viewed 02/10/10).
[86] M. Hasegawa-Johnson, J. Baker, S. Borys, K. Chen, E. Coogan, S. Green-berg, A. Juneja, K. Kirchho, K. Livescu, S. Mohan, J. Muller, K. Sonmez,and T. Wang, “Landmark-based speech recognition: report of the 2004 JohnsHopkins Summer Workshop,” in Proc. of ICASSP, pp. 214-216, 2005.
118