Large Models for Large Corpora:preliminary findings
Patrick Nguyen
Ambroise Mutel
Jean-Claude Junqua
Panasonic Speech Technology Laboratory
(PSTL)
… from RT-03S workshop
• Lots of data helps• Standard training can be done in reasonable time
with current resources• 10kh: it’s coming soon
• Promises:– Change the paradigm– Use data more efficiently– Keep it simple
The Dawn of a New Era?• Merely increasing the model size is
insufficient• HMMs will live on• Layering above HMM classifiers will do• Change topology of training• Large models with no impact on decoding• Data-greedy algorithms• Only meaningful with large amounts
Two approaches: changing topology
• Greedy models – Syllable units– Same model size, but consume more info– Increase data / parameter ratio– Add linguistic info
• Factorize training: increase model size– Bubble splitting (generalized SAT)– Almost no penalty in decoding– Split according to acoustics
Syllable units
• Supra-segmental info
• Pronunciation modeling (subword units)
• Literature blames lack of data
• TDT+ coverage is limited by construction (all words are in the decoding lexicon)
• Better alternative to n-phones
What syllables?• NIST / Bill Fisher tsyl software (ignore
ambi-syllabic phenomena)
• “Planned speech” mode
• Always a schwa in the lexicon (e.g. little)
• Phones del/sub/ins + supra-seg info is good
onset rhyme
coda (Body, Tail)peak
Syllable units: facts• Hybrid (syl + phones)
– [Ganapathiraju, Goel, Picone, Corrada, Doddington, Kirchhoff, Ordowski, Wheatley; 1997]
• Seeding with CD-phones works– [Sethy, Narayanan; 2003]
• State tying works [PSTL]
• Position-dependent syllables work [PSTL]
• CD-syllables kind of works [PSTL]
• ROVER should work – [Wu, Kingsbury, Morgan, Greenberg; 1998]
• Full embedded re-estimation does not work [PSTL]
Coverage and large corpus• Warning: biased by construction• In the lexicon: 15k syllables
• About 15M syllables total (10M words)• Total: 1600h / 950h filtered
127 examples14 examples
1 example
Coverage
CI-Syl Pos-CI Syl Pos-CD Syl
10k 99.7% 99.5% 94.7%
6k 99.7% 98.5% 90.7%
3k 98.2% 95.0% 82.2%
2k 95.9% 91.1% 76.0%
1.5k 93.4% 87.6% 71.3%
1k 88.7% 81.6% 64.6%
Hybrid: backing off• Cannot train all syllables => back-off to phone
sequence• “Context breaks”: context-dependency chain will
break• Two kinds of back-off: abduction
– True sequence: ae_b d_ah_k sh______ih______n– [Sethy+]: ae_b d_ah_k sh+ih sh-ih+n ih-n– [Doddington+]: ae_b d_ah_k k-sh+ih sh-ih+n ih-n
• Tricky• We don’t care: state-tying and almost never backoff
Seeding• Copy CD models instead of flat-starting
• Problem at syllable boundary (context break)
• CI < Seed syl < CD
• Imposes constraints on topologyih-n+????-sh+ih sh-ih+n
Seeding (results)
• Mono-Gaussian models
• Trend continues even with iterative split
• CI: 69% WER
• CD: 26% WER
• Syl-flat: 41% WER
• Syl-seed: 31% WER (CD init)
State-tying
• Data-driven approach
• Backing-off is a problem => train all syllables– too many states/distributions– too little data (skewed distribution)
• Same strategy as CD-phone: entropy merge
• Can add info (pos, CD) w/o worrying about explosion in # of states
State-tying (2)
• Compression w/o performance loss• Phone-internal, state-internal bottom-up merge to
limit computations• Count about 10 states per syllable (3.3 phones)
• Pos-dep CI syllables• 6000syl model: 59k Gaussians: 38.7% WER• Merged to 6k: (6k Gaussians): 38.6% WER• Trend continues with iterative split
Position-dependent syllables
• Word-boundary info (cf triphones)• Example:
– Worthiness: _w_er dh_iy n_ih_s– The: _dh_iy_
• Missing from [Sethy+] and [Dod.+]
• Results: (3% absolute at every split)– Pos-indep (6k syl): 39.2% WER (2dps)– Pos-dep: 35.7% WER (2dps)
CD-Syllables
• Inter-syllable context breaks• Context = next phone• Next syl? Next vowel? (Nucleus/peak)• CD(phone)-syl >= CD triphones• Small gains!
• All GD results• CI-syl (6k syl): 19.0% WER• CD-syl (20k syl): 18.5% WER• CD-phones: 18.9% WER
Segmentation
• Word and Subword units give poor segmentation
• Speaker-adapted overgrown CD-phones are always better
• Problem for: MMI and adaptation
• Results: (ML)– Word-internal: 21.8% WER– Syl-internal: 19.9% WER
MMI/ADP didn’t work well
• MMI: time-constrained to +/- 3ms within word boundary
• Blame it on the segmentation (Word-int)
ML MMI +adp
CD-phones 18.9% 17.5%-1.4%
15.5%-2%
CD-syl 18.5% 17.4%-1.1%
16.0%-1.4%
ROVER• Two different systems can be combined
• Two-pass “transatlantic” ROVER architecture
• CD-phones align, phonetic classes
• No gain (broken confidence), deletions– MMI+adp: 15.5% (CDp) and 16.0% (SY)– Best ROVER: 15.5% WER (4-pass, 2-way)
Adapted CD-phonesSyllable models
Summary: architecture
CD-phones
POS-CD syllables
ROVER
POS-CI syllables Merged (6k)
Merged (3k)GD / MMI
Decode Adapt+decode
Conclusion (Syllable)
• Observed similar effects as literature
• Added some observations (state tying, CD, pos, ADP/MMI)
• Performance does not beat CD-phones yet– CD phones: 15.5% WER ; syl: 16.0% WER
• Some assumptions might cancel the benefit of syllable modeling
Open questions
• Is syllabification (grouping) better than random? Syllable?
• Planned vs spontaneous speech?• Did we oversimplify?• Why do subword units resist to auto-
segmentation?• Why didn’t CD-syl work better?• Language-dependent effects
Bubble Splitting
• Outgrowth of SAT• Increase model-size 15-fold w/o
computational penalty in train/decode• Also covers VTLN implementation
• Basic idea:– Split training into locally homogenous regions
(bubbles), and then apply SAT
SAT vs Bubble Splitting
• SAT relies on locally linearly compactable variabilities
• Each Bubble has local variability
• Simple acoustic factorizationSAT Bubbl
e
Adaptation (MLLR)
TDT and speakers labels
• TDT is not speaker-labeled
• Hub4 has 2400 nominative speakers
• Use decoding clustering (show-internal clusters)
• Males: 33k speakers
• Females: 18k speakers
• Shows: 2137 (TDT) + 288 (Hub4)
Input Speech
TDT
Decoded Words
Maximum Likelihood
Multiplex
Compact Bubble Models (CBM)
SPLIT
SPLIT
ADAPT
ADAPT NORMALIZE
NORMALIZE
Bubble-Splitting: Overview
M
A
L
E
F
E
M
A
L
E
VTLN implementation
• VTLN is used for clustering
• VTLN is a linear feature transformation (almost)
• Finding the best warp
VTLN: Linear equivalence•According to [Pitz, ICSLP 2000], VTLN is equivalent to a linear transformation in the cepstral domain:
0
2|)(|log)cos(1
deXkc ik
~|)(|log)~cos(
1)(~ 2~
0deXnc i
n )(~ Φ
~))~(cos()~cos(2
)( 1
0dkΦnAnk
K
kknkn cAc
0
)()(~
•The relationship between a cepstral coefficient ck and a warped one (stretched or compressed) is as follows:
• Energy, Filter-banks, and cepstral liftering imply non-linear effects
dknM
csteAM
nk
)(
00
)]cos()(cos[)(
)(~ Φ
f
f~
)700
1log(1127)(
ffM
•The Authors didn’t take the Mel-scale into account. No closed-form solution in that case :
VTLN is linearDecoding Algorithm:
)}A,|({
Ai
i
maxarg OLINPUT
SPEECH
Decode with
Ai and λ
DECODED
WORDS
mt
tmT
tm oRotLQ,
ii2
i )}A()A(|A|log){(21
Experimental results:
Statistical multiplex
• GD-mode• Faster than Brent search• 3 times faster than
exhaustive search• Based on prior
distribution of alpha• Test 0.98, 1, and 1.02• If 0.98 wins, continue
5)()(14.1
86.0
NPNE
4 Q evaluations: N0.98 = N1.02
= 4
0.98
3 Q evaluations: N1.00=3
1.00
Bubble Splitting: PrincipleTraining Speaker
Bubble Bi
Partial center: satmodel λi
1. Separate conditions• VTLN
2. Train Bubble model3. Compact using SAT
• Feature-space SAT
SAT works on homogenous conditions
Results
1st-pass adapted
Baseline GD 19.0% 16.6%
SAT 18.7% 16.5%
Bubbles 18.5% 16.0%
192k => 384k 18.6% 16.3%
About 0.5% WER reductionDouble model size => 0.3% WER
Conclusion (Bubble)
• Gain: 0.5% WER
• Extension of SAT model compaction
• VTLN implementation more efficient
Open questions
• Baseline SAT does not work?
• Speaker definition?
• Best splitting strategy? (One per warp)
• Best decoding strategy? (Closest warp)
• Best bubble training? (MAP/MLLR)
• MMIE
Conclusion
• What do we do with all of these data?
• Syllable + bubble splitting
• Two narrowly explored paths among many
• Promising results but nothing breathtaking
• Not ambitious enough?
System setup• RT03eval• 6x RT• Same parameters as RT03S eval system
– WI triphones, gender dependent, MMI– 2pass– Global MLLU + 7-class MLLR– 39 MFCC + non-causal CMS (2s)– 192k Gaussians, 3400 mixtures– 128 Gaussians / mix => merged