Post on 16-Jan-2016
transcript
Katholieke Universiteit Leuven - ESAT, BELGIUM
Combining Abstract and Exemplar Models
in ASR
Dirk Van Compernolle Kris Demuynck, Mathias De Wachter
S2S Nijmegen Workshop
February 10-14, 2008
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 2
Overview
• PART I: Example based models in ASR: – Motivation
– Proof-of-Concept
– Baseline Results
– Required Extensions
• PART II: Bottom-up vs. Top-down processing in ASR– Do we care ?
– A top-down search engine with bottom-up phonetic scoring
– A combined template matching and HMM recognizer
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 3
PART I
Example Based ASR
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 4
Example Based ASR
• Example based ASR was successful in Speaker-Dependent Isolated Word Recognition. It was abandoned when technology moved to continuous speaker independent recognition
Why re-activate an approach that smoothly died 25 yrs ago ?• Psycho-linguistics and intuition give evidence of the existence of
individual memory traces (spanning many phonemes) in:– human speech recognition in general
– music/song memory & recognition
– second language learning
• Success of concatenative Text-to-Speech
• Acknowledgement of limitations to model based (HMM based) ASR
• Computing demands for continuous large vocabulary recognition were essential bottlenecks – that may not be relevant any longer today.
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 5
Today’s Prototypical HMM based ASRBeads-on-a-String Model
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 6
Phone Modeling with HMMs
ph(j)_1ph(j)_1
ph(j)_2ph(j)_2
ph(j)_Nph(j)_Njj
SSj1j1 SSj2j2SSj3j3
means +
variances
+duration modelM
UL
TI-
ST
AT
E
MO
DE
Lfo
r p
hone
‘j’
short-time spectral representationsM
AN
Y
EX
AM
PL
ES
of p
hon
e ‘j
’
TRAINING
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 7
Iterative HMM Training
speechdatabase
Feature Extraction
dictionaryphone set
reference HMM
word leveltranscription
phone leveltranscription
State (sub-phone) Segmentation
PhoneHMMs
Viterbi Alignment
Re-estimation
words-to-phones
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 8
HMM Model Building
• Based on a 2-D LDA projection of mel-cepstra optimized for digit recognition
• ‘S1,S2,S3’ represent the 3 CD HMM-states of the central vowel in “f I ve”
• Ellipses indicate the ‘1-sigma’ boundaries
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 9
HMMs – Strengths
• Strong mathematical framework – statistical pattern matching / Bayesian Classification
– optimal strategy under the assumption of a perfect model with
sufficient training data !!
– Fully automatic training (inner loop)
– Ability to (optimally) combine the information from thousands of
hours of example speech
• Highly scalable: more data leads to better results– allows for training a more refined model with more parameters that
gets closer to reality (model assumptions)
– a better trained model that is more robust to intrinsic variability
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 10
HMMs - Weaknesses
• Model is intrinsically flawed, because of:– within state (i.e. short-term) stationarity assumption
– 1st order Markov assumption: state independence
– presumed frame by frame independence
• This implies– no guaranteed optimality for Bayesian Classification /
Maximum Likelihood paradigm
– continuous effort to improve (patch) the model
– best performance with discriminative training procedures
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 11
HMMs30 yrs of improvements on the basic model
“If the model was correct,
then would be better than HMMs and Maximum Likelihood Training.
So, let’s stick to the concept and fix the model”
• Multi-state Context-Dependent models
• Multi-gaussian modeling of the observation densities
• Derivative Features
“For this we only need bigger computers
and more data that allows us“
• To make these complex models with more degrees of freedom
• To do a proper training of these hundreds of millions of parameters
• To perform recognition with them in real-time
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 12
HMMs30 yrs of improvements on the basic model
…. Then we will reach nirvana, unless …
… after 30yrs the model is still basically flawed– because of poor segmental modeling
… more training data does not seem to result in better models
any longer– because requirements for further improvement seem to grow
logarithmically
– because for smaller languages more data is just not feasible
… so, today computers have more power
than we know what to do with it
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 13
Example Trajectories and HMM states
• Trajectories contain more information than the HMM state sequence !!
• Trajectories show a very different picture than the ‘cloud’ of points underlying HMM state training
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 14
Aligning of individual trajectories to HMMs
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 15
S1 S2
HMM vs. Segmental
• HMM viewpoint: red and black sequence of observations yield identical scores =
>• Segmental viewpoint: black trajectory is significantly more plausible than the red one
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 16
Segmental Modeling in HMMs
• Segmental properties are obviously important within
phonemes and across multiple phonemes
• HMMs loose this longer time-scale view despite the
modifications made to the model over the years
• Attempts to make segmental statistical models have not
been very successful so far
• Detailed trajectory properties were well preserved in the old
template matching DTW systems
• …..
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 17
Is example based
large vocabulary continuous recognition
a viable alternative
to model state based (HMM) recognition ?
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 18
Example Based ASRResearch Agenda
• Proof-of-concept phase
– To build a baseline mid/large vocabulary system with
medium sized databases
– Show similar recognition performance to HMM systems
• Competitive phase
– To build systems that can handle huge databases
– Build systems that go beyond the naïve extrapolation of
today’s HMMs
– Improve on performance at acceptable cost
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 19
HMM vs. Exemplar
HMM EXEMPLAR
Units Phones/Allophones Phone templates
Local Similarity Multi-gaussian distributions Mahalanobis distance
Time Alignment Viterbi on HMM states Dynamic Time Warping
Transition
probabilities
Phonetic dictionary, LM Phonetic dictionary, LM
Long-span speech attributes
Search Time Synchronous Beam
Search
Time Synchronous Beam
Search
Training • HMM parameters (multi-
gaussian distributions)
• Type and number of
allophonic variants
• Labeling and segmentation
of training database
• Parameter estimation for
distance metric
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 20
Example Based LVCSR: How ?Baseline System
• Speech Database [ = “Memory” ] – same databases as used for training statistical systems
– collection of long stretches of acoustic vectors
– annotated at multiple levels: phone, syllable, word, ..
– any of the annotations (incl. segmentation) can serve as a “template”
• Recognition Paradigm– Find that sequence of templates that best matches a given input by using
Dynamic Time Warping (DTW)
– Use the ‘Template Transition Cost’ concept to control template transitions
• Borrow other components from existing HMM technology– Token passing Time Synchronous Beam search
– N-gram language modeling
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 21
Aligning of trajectories
OBSERVE: the “closest matching template” is by no means the sequence of nearest neighbors for each frame
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 22
Issues in Example Based ASRLocal Distance Metric
• The utterance based distortion is the sum of local (frame based) distances
• One of the great advances of HMM systems was the use of more complex metrics than previously used in DTW
– class (phone state) dependent – multi-gaussian distributions with many parameters
• It is possible to transfer some of the HMM improvements to the DTW framework, but not all and not in a trivial manner:
– Local Mahalanobis distance– further improvements by applying other ideas from non-parametric statistics: outlier
correction, data sharpening, adaptive kernel Mahalanobis, … [see papers ICASSP07, INTERSPEECH07]
• Weakness of (our) current system– score is based on a sequence of reference templates– From a KNN perspective the score should be based on group voting
• …. ongoing research
Input
Solution
15.3 hours of speechfrom 84 speakers
only 14 segments of 2 seconds
relevant to the searchare shown
# # I! t s tI! l @ n klI! r # # #
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 24
Input
Concatenation of chosen templates
Concatenation + dynamic time
warping (DTW)
# # I!t s tI! l @n klI! r ## #
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 25
# phone
errors
# reference
segments
Avg. # templates
in one segment
Input 55
No costs 12 48 1.2
Optimal settings 3 19 2.9
Controlling Template ConcatenationUsing a concatenation cost model, based on:
– natural successor templates in the reference database
– phonetic context
– gender, accent, recording condition, …
has great impact on selected segment length
naturalness of resynthesized reference
lowering the error
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 26
Experiments - Task descriptions
• TIMIT– Test and train sets hand-labeled
– 1.6 hours of training material, 462 speakers
• Resource Management (RM)– 991 word lexicon: CMU v0.4 [train and test highly matched !]
– 3.8 hours of noise-free training speech
– 150.000 phone templates
• Wall Street Journal (WSJ0)– Automatically segmented and labeled by HMM system based on sentence
transcription
– 15.3 hours of training material, 84 speakers
– 4986 words
– 450.000 phone templates (?)
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 27
Phone string recognition
Setup PER% nat.
suc.
% cont.
match
% gend.
match
Baseline 31.3% 14% 48% 82%
Context costs 29.1% 19% 77% 83%
Gender costs 31.0% 16% 48% 99%
Natural successors 30.3% 49% 66% 91%
All costs development set 27.6% 38% 83% 98%
All costs evaluation set 29.6% 38% 83% 98%
Reference HMM 27.7%N/A
Best in the literature 24.8%
TIMIT
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 28
Phone string recognition
Setup PER% nat.
suc.
% cont.
match
% gend.
match
Baseline 20.8% 7% 54% 85%
Context costs 16.9% 14% 86% 87%
Gender costs 20.3% 8% 54% 99%
Natural successors 15.5% 66% 80% 95%
All costs development set 13.8% 60% 88% 99%
All costs evaluation set 14.2% 59% 87% 99%
Reference HMM 14.9%N/A
Best in the literature N/A
WSJ0
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 29
Sentence recognition
dev oct89 feb91 sep92 avg
HMM WER (%) 2.26 2.83 2.13 5.08 3.35
DTW WER (%) 2.23 2.79 2.21 4.22 3.07
% natural successors 69 69 67 68 68
% matching contexts 94 94 94 93 94
% matching gender 99 98 98 98 98
Decoder speed ±10 x real time
Bottom-up speed ±4 x real time
Resource Management
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 30
Sentence recognition
dev92 nov92
HMM WER (%) 6.74 5.10
DTW WER (%) 8.69 8.11
% natural successors 38 36
% matching contexts 84 81
% matching gender 98 98
Decoder speed ±20 x real time
Bottom-up speed ±17 x real time
WSJ0
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 31
Example based ASR:Discussion
• For some task (medium sized problems) we were able to build a system that matches or exceeds performance of state-of-the-art HMM systems
PROOF OF CONCEPT: OK
• Success is critically dependent on the ability to use multi-phone segments frame based distance metric is not as powerful (yet) as with HMMs !
[ single nearest paradigm instead of KNN ?]
potentially better modeling of phone transitions than CD-HMMs [ i.e. NO modeling ! ]
• Challenges to move to large vocabulary tasks and very large databases:Richness of the database: very many contexts by very many speakersRichness of the database: very many contexts by very many speakers
Move away from the naive HMM-like top-down search engine Make better use of the available data : normalize for speaker (VTLN), acoustics
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 32
Issues in Example Based ASRSearch Space Explosion
• any allophone can be represented by any of its examples
• search space keeps on growing with larger example
databases: factor 100, 1.000, 10.000, ….
• large amount of redundant information
• hence a large inefficiency
• traditional pruning approaches will not be efficient
• early data driven (bottom-up) pruning is essential
[ this was applied in all experiments, but not discussed ]
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 33
PART II:ASR Search Techniques
Top-Down and Bottom-Up Combined
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 34
Top-Down Search StrategyConcept
• hypothesize: – all possible sentences
allowed by the language model
• find: – the one that best matches the observed acoustics (spectral like
frame based parameters)
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 35
Top-Down Search StrategyTime-Synchronous Implementation
• initialize: start with the dummy ‘start sentence’ word
• loop:– extend all hypotheses
• that are at or near word-end positions with all next possible words
• find phone/template string equivalents for these extensions
– fetch a new segment (frame) of data• incrementally compute the matching score between all hypotheses active
in the search and the observed acoustics
– order the hypotheses according to score and prune away• hypotheses that are ‘significantly’ worse than the best one
• hypotheses that fall below the Top-N
• end : accept Top-1 as your final result (best guess)
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 36
Top-Down Search StrategyWhy has it been so successful ?
• The Language Model constraints are very restrictive– therefore it makes sense to apply them first
– strong (overweighted) language models have been an essential ingredient for many commercial successes in ASR
• The top-down search is very tolerant for errors in one the weakest chains in the beads-on-a-strong model: – errors in the phonetic dictionary are abundant
• pronunciation dictionaries don’t contain all possible pronunciations
• people don’t talk the way they are supposed to talk
– but substituted/missing/inserted phone segments are absorbed by• forcing ‘a few’ frames to align with the presumed phone
• this mismatch cost may not be so big, because
– HMM scores smoothly decay as points are further from the class centroid
– HMMs will stretch or compress segments to their own benefit
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 37
At the opposite end:The Intuitive Bottom-Up Recognition Paradigm
spectral analysis, feature extraction, speech signal processing
noise suppression, …
phones phone(tic) recognition
words word/sentence recognition
phonetic features
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 38
Bottom-Up Why did it fail ?
• Intuitive bottom-up was the paradigm of choice in the early days of speech recognition (1970’s)
• The prototypical implementation has been:– recognize the next layer on the basis of the layer immediately below
– acknowledge that recognition will be imperfect, for this develop statistical pattern matching techniques that allow for insertions/deletions/substitutions
• The biggest failure in this paradigm– is to use a single best recognition as information carrier between two layers
• As errors – propagate at great speed and with great prosperity throughout the search
network
– the acoustic-phonetic recognition is not good enough to allow any error correction paradigm to function well
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 39
Bottom-Up and Top-DownEssential Weaknesses
• Bottom-Up– difficult to recover from recognition errors in lower layers
– the correct hypothesis might never get activated
• Top-down– the linguistic universe is limited to a restrictive
predefined language model
– difficult (impossible) to discover new things
– practically impossible for an LVCSR example based system
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 40
What’s in the middle of all this ?
The phoneme concept
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 41
Are Phone(me)s Real ?
speech signalis ‘given’ thus unambiguous
but contains massive amounts of non-phonetic information/noise
phones (allophones, phonemes)
a convenient intermediate level both for humans and machines
ill-defined and highly ambiguous
words(morphemes)
conceptual levelquite unambiguously recognized
on the basis of the acoustics
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 42
Recognition Models with Early Abstraction
phone graph
bottom-up recognition of low-level abstract units
words(morphemes)
top-down search for best possible word sequence on the basis of uncertain phonetic information
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 43
Recognition Modelwith Early Abstraction
spectral analysis, feature extraction, speech signal processing
noise suppression, …
probabilistic phone recognition
words Top-down search engine (driven by LM + phonetic dict.)
with phone graph as input
phone graph
phonemes
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 44
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 45
It could make sense
• Bottom-up / Early abstraction is required for many skills– “fast match”
– new word recognition
– nonsense word recognition
• Fully top-down was/is an engineering/economic necessity
• Phone recognition is influenced by top-down linguistic processes wrt.– recognition speed
– linguistic overrules
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 46
If we can get it to workPhone Graph Quality
• phone graph error rate should be low (a few %)
• phone graph density should be moderate– search on the phone graph should not be slower than on the frame data
– very bad matches should NOT be included • as their acoustic scores make little or no sense
• a more abstract ‘substitution/insertion/deletion’ score will make more sense
• Error model– should serve to overcome genuine phone errors
• dictionary mistakes
• gross pronunciation mistakes
– should be gentle on the search effort
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 47
If we can get it to workError Model
• should serve to overcome genuine phone errors– dictionary mistakes– gross pronunciation mistakes
• should be gentle on the search effort – generic insertion/deletion/substitution will again make the search explode– “single error” model: each error should be embedded between 2 phones found in
the graph
RM WSJ 5k WSJ 20k
PhER (1-best) 10.1% 9.42% 11.62%
WER (all-in-one) 3.08% 3.96% 9.99%
WER (graph – no error model) 5.00% 5.53% 11.98%
WER (graph – 1-errror model 3.14% 3.89% 9.98%
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 48
New Opportunities
• Assuming a quite dense phone graph with few
errors
– total search effort significantly smaller than in fully top-
down [ FLAVOR !! ]
– possibility to model more complex linguistic knowledge
sources
– a way out for controlling the search problem of example
based systems !!
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 49
Experiments with combined system(constrained lexicon on RM)
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 50
Experiments with combined system(constrained lexicon on RM)
dev. oct89 feb91 sep92 avg. impr.
HMM baseline 2.26 2.83 2.13 5.08 3.35 0%
Min. graph error 0.12 0.22 0.08 0.43 0.24 93%
DTW bottom-up 3.55 3.76 2.98 4.88 3.87 -16%
DTW on graph 3.05 4.43 3.14 5.35 4.31 -29%
phone combine 1.87 2.65 1.69 3.52 2.62 22%
phone/bi/tri comb. 1.68 2.24 1.65 3.32 2.40 28%
phone no gender 2.11 2.65 2.01 3.66 2.77 17%
phone no concat 2.23 2.94 2.25 4.77 3.32 0%
phone no DTW 2.38 2.50 2.13 3.79 2.81 16%
Combined HMM + Exemplar ASR System
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 52
Conclusions
• Exemplar based recognition can improve results
over HMM if longer than phoneme length used can
be used in a productive way
• Abstractionist bottom-up phonetic recognition can
be a very useful component in ASR for
– fast match in conjunction with• more complex linguistic models
• a more efficient exemplar system
– discovery of out-of-vocabulary words
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 53
About local distances
• Basic concept: take the shape of the class (phone identity)
into account to measure the distance:
B
A
=2σB,2
=σA,2
yB
x
yA
Euclidean Viewpoint:
d(x,yB) = d(x,yA)
Mahalanobis Viewpoint
d(x,yB) = 2 . d(x,yA)
Dirk Van Compernolle, Combining Abstract & Exemplar Models in ASR 54
Is this pronunciation variation modeling ?
• it’s only part of it, we also assume:– phonetic dictionaries containing ALL standard pronunciation
variants
– pronunciation rules for continuous speech
• but acknowledge, that– knowledge sources contain mistakes
– people don’t pronounce the phonemes they are supposed to pronounce according to the dictionary
– any rule set trying to explain everything gives too much chance to small probability events and makes the search explode (once again)