Post on 05-Jul-2020
transcript
Introduction to Discriminative Trainingin Speech Recognition
Ralf Schluter, Georg Heigold
Lehrstuhl fur Informatik 6Human Language Technology and Pattern Recognition
Computer Science Department, RWTH Aachen UniversityD-52056 Aachen, Germany
January 14, 2010
Schluter: Introduction to Discriminative Training in Speech Recognition 1 January 14, 2010
Contents
Introduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 2 January 14, 2010
Outline
IntroductionMotivationOverview
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 3 January 14, 2010
MotivationAim of discriminative methods: improve class separation
I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)
I maximum mutual information (MMI) training: maximize
reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑
c ′
p(c ′) · pθ(x |c ′)Where’s the difference?
I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)
I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.
I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.
Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010
MotivationAim of discriminative methods: improve class separation
I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)
I maximum mutual information (MMI) training: maximize
reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑
c ′
p(c ′) · pθ(x |c ′)Where’s the difference?
I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)
I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.
I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.
Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010
MotivationAim of discriminative methods: improve class separation
I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)
I maximum mutual information (MMI) training: maximize
reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑
c ′
p(c ′) · pθ(x |c ′)
Where’s the difference?
I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)
I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.
I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.
Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010
MotivationAim of discriminative methods: improve class separation
I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)
I maximum mutual information (MMI) training: maximize
reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑
c ′
p(c ′) · pθ(x |c ′)Where’s the difference?
I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)
I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.
I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.
Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010
MotivationAim of discriminative methods: improve class separation
I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)
I maximum mutual information (MMI) training: maximize
reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑
c ′
p(c ′) · pθ(x |c ′)Where’s the difference?
I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)
I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.
I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.
Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010
MotivationAim of discriminative methods: improve class separation
I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)
I maximum mutual information (MMI) training: maximize
reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑
c ′
p(c ′) · pθ(x |c ′)Where’s the difference?
I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)
I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.
I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.
Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010
MotivationAim of discriminative methods: improve class separation
I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)
I maximum mutual information (MMI) training: maximize
reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑
c ′
p(c ′) · pθ(x |c ′)Where’s the difference?
I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)
I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.
I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.
Schluter: Introduction to Discriminative Training in Speech Recognition 4 January 14, 2010
MotivationI In practice, model assumptions are incorrect, and training
data is limited. Here discriminative training can be beneficial.
Example: a two class problem (with pooled covariance matrix)
-1
0
1
2
-5 -4 -3 -2 -1 0 1 2
y
x
ML/MMI
class -1class +1
MLMMI
-1
0
1
2
-5 -4 -3 -2 -1 0 1 2
y
x
ML MMI
I Clearly, in case of ML training, the outlier deteriorates thedecision boundary, whereas MMI training registers the minorimportance of the outlier.
I MMI captures decision boundary, although model assumptiondoes not fit in second case (pooled covariance).
Schluter: Introduction to Discriminative Training in Speech Recognition 5 January 14, 2010
MotivationI In practice, model assumptions are incorrect, and training
data is limited. Here discriminative training can be beneficial.
Example: a two class problem (with pooled covariance matrix)
-1
0
1
2
-5 -4 -3 -2 -1 0 1 2
y
x
ML/MMI
class -1class +1
MLMMI
-1
0
1
2
-5 -4 -3 -2 -1 0 1 2
y
x
ML MMI
I Clearly, in case of ML training, the outlier deteriorates thedecision boundary, whereas MMI training registers the minorimportance of the outlier.
I MMI captures decision boundary, although model assumptiondoes not fit in second case (pooled covariance).
Schluter: Introduction to Discriminative Training in Speech Recognition 5 January 14, 2010
MotivationI In practice, model assumptions are incorrect, and training
data is limited. Here discriminative training can be beneficial.
Example: a two class problem (with pooled covariance matrix)
-1
0
1
2
-5 -4 -3 -2 -1 0 1 2
y
x
ML/MMI
class -1class +1
MLMMI
-1
0
1
2
-5 -4 -3 -2 -1 0 1 2
y
x
ML MMI
I Clearly, in case of ML training, the outlier deteriorates thedecision boundary, whereas MMI training registers the minorimportance of the outlier.
I MMI captures decision boundary, although model assumptiondoes not fit in second case (pooled covariance).
Schluter: Introduction to Discriminative Training in Speech Recognition 5 January 14, 2010
MotivationI In practice, model assumptions are incorrect, and training
data is limited. Here discriminative training can be beneficial.
Example: a two class problem (with pooled covariance matrix)
-1
0
1
2
-5 -4 -3 -2 -1 0 1 2
y
x
ML/MMI
class -1class +1
MLMMI
-1
0
1
2
-5 -4 -3 -2 -1 0 1 2
y
x
ML MMI
I Clearly, in case of ML training, the outlier deteriorates thedecision boundary, whereas MMI training registers the minorimportance of the outlier.
I MMI captures decision boundary, although model assumptiondoes not fit in second case (pooled covariance).
Schluter: Introduction to Discriminative Training in Speech Recognition 5 January 14, 2010
MotivationI In practice, model assumptions are incorrect, and training
data is limited. Here discriminative training can be beneficial.
Example: a two class problem (with pooled covariance matrix)
-1
0
1
2
-5 -4 -3 -2 -1 0 1 2
y
x
ML/MMI
class -1class +1
MLMMI
-1
0
1
2
-5 -4 -3 -2 -1 0 1 2
y
x
ML MMI
I Clearly, in case of ML training, the outlier deteriorates thedecision boundary, whereas MMI training registers the minorimportance of the outlier.
I MMI captures decision boundary, although model assumptiondoes not fit in second case (pooled covariance).
Schluter: Introduction to Discriminative Training in Speech Recognition 5 January 14, 2010
Outline
IntroductionMotivationOverview
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 6 January 14, 2010
Overview
Questions:
I Which discriminative criterion to take?
I Relation to decision rule and evaluation measure?
I How to optimize criterion?
I Efficiency?
I Influence of modeling?
I Uniqueness of solution?
I Generalization?
Bottomline:
I How to utilize available training materialto obtain optimum recognition performance?
Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010
Overview
Questions:
I Which discriminative criterion to take?
I Relation to decision rule and evaluation measure?
I How to optimize criterion?
I Efficiency?
I Influence of modeling?
I Uniqueness of solution?
I Generalization?
Bottomline:
I How to utilize available training materialto obtain optimum recognition performance?
Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010
Overview
Questions:
I Which discriminative criterion to take?
I Relation to decision rule and evaluation measure?
I How to optimize criterion?
I Efficiency?
I Influence of modeling?
I Uniqueness of solution?
I Generalization?
Bottomline:
I How to utilize available training materialto obtain optimum recognition performance?
Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010
Overview
Questions:
I Which discriminative criterion to take?
I Relation to decision rule and evaluation measure?
I How to optimize criterion?
I Efficiency?
I Influence of modeling?
I Uniqueness of solution?
I Generalization?
Bottomline:
I How to utilize available training materialto obtain optimum recognition performance?
Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010
Overview
Questions:
I Which discriminative criterion to take?
I Relation to decision rule and evaluation measure?
I How to optimize criterion?
I Efficiency?
I Influence of modeling?
I Uniqueness of solution?
I Generalization?
Bottomline:
I How to utilize available training materialto obtain optimum recognition performance?
Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010
Overview
Questions:
I Which discriminative criterion to take?
I Relation to decision rule and evaluation measure?
I How to optimize criterion?
I Efficiency?
I Influence of modeling?
I Uniqueness of solution?
I Generalization?
Bottomline:
I How to utilize available training materialto obtain optimum recognition performance?
Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010
Overview
Questions:
I Which discriminative criterion to take?
I Relation to decision rule and evaluation measure?
I How to optimize criterion?
I Efficiency?
I Influence of modeling?
I Uniqueness of solution?
I Generalization?
Bottomline:
I How to utilize available training materialto obtain optimum recognition performance?
Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010
Overview
Questions:
I Which discriminative criterion to take?
I Relation to decision rule and evaluation measure?
I How to optimize criterion?
I Efficiency?
I Influence of modeling?
I Uniqueness of solution?
I Generalization?
Bottomline:
I How to utilize available training materialto obtain optimum recognition performance?
Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010
Overview
Questions:
I Which discriminative criterion to take?
I Relation to decision rule and evaluation measure?
I How to optimize criterion?
I Efficiency?
I Influence of modeling?
I Uniqueness of solution?
I Generalization?
Bottomline:
I How to utilize available training materialto obtain optimum recognition performance?
Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010
Overview
Questions:
I Which discriminative criterion to take?
I Relation to decision rule and evaluation measure?
I How to optimize criterion?
I Efficiency?
I Influence of modeling?
I Uniqueness of solution?
I Generalization?
Bottomline:
I How to utilize available training materialto obtain optimum recognition performance?
Schluter: Introduction to Discriminative Training in Speech Recognition 7 January 14, 2010
OutlineIntroduction
Training CriteriaNotationGeneral ApproachProbabilistic Training CriteriaError-Based Training CriteriaPractical IssuesComparative Experimental Results
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 8 January 14, 2010
Notation
Xr sequence xr ,1, xr ,2, ..., xr ,Tr acoustic observationvectors
Wr spoken word sequence wr ,1,wr ,2, ...,wr ,Nr in trainingutterance r
W any word sequence
p(W ) language model probability, supposed to be given
pθ(Xr |W ) acoustic emission probability/acoustic model
θ set of all parameters of the acoustic model
Mr set of competing word sequences to be considered
f smoothing function
Schluter: Introduction to Discriminative Training in Speech Recognition 9 January 14, 2010
OutlineIntroduction
Training CriteriaNotationGeneral ApproachProbabilistic Training CriteriaError-Based Training CriteriaPractical IssuesComparative Experimental Results
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 10 January 14, 2010
General ApproachTraining
I input: training data and stochastic model pθ(X ,W )input: with free model parameters θ
I output: “optimal” model parameters θI optimality defined via training criterion
θ := arg maxθ{F (θ)}
Unified training criterion [Macherey+ 2005]
F (θ) =R∑
r=1
f
(log
(∑W p(W )pθ(Xr |W ) · A(W ,Wr )∑
W∈Mrp(W )pθ(Xr |W )
))I covers maximum mutual information (MMI), minimum
classification error (MCE), minimum phone/word error(MPE/MWE)
I control set Mr of competing hypotheses, cost function,smoothing function, scaling of models (not shown)
Schluter: Introduction to Discriminative Training in Speech Recognition 11 January 14, 2010
General ApproachTraining
I input: training data and stochastic model pθ(X ,W )input: with free model parameters θ
I output: “optimal” model parameters θI optimality defined via training criterion
θ := arg maxθ{F (θ)}
Unified training criterion [Macherey+ 2005]
F (θ) =R∑
r=1
f
(log
(∑W p(W )pθ(Xr |W ) · A(W ,Wr )∑
W∈Mrp(W )pθ(Xr |W )
))
I covers maximum mutual information (MMI), minimumclassification error (MCE), minimum phone/word error(MPE/MWE)
I control set Mr of competing hypotheses, cost function,smoothing function, scaling of models (not shown)
Schluter: Introduction to Discriminative Training in Speech Recognition 11 January 14, 2010
General ApproachTraining
I input: training data and stochastic model pθ(X ,W )input: with free model parameters θ
I output: “optimal” model parameters θI optimality defined via training criterion
θ := arg maxθ{F (θ)}
Unified training criterion [Macherey+ 2005]
F (θ) =R∑
r=1
f
(log
(∑W p(W )pθ(Xr |W ) · A(W ,Wr )∑
W∈Mrp(W )pθ(Xr |W )
))I covers maximum mutual information (MMI), minimum
classification error (MCE), minimum phone/word error(MPE/MWE)
I control set Mr of competing hypotheses, cost function,smoothing function, scaling of models (not shown)
Schluter: Introduction to Discriminative Training in Speech Recognition 11 January 14, 2010
General ApproachTraining
I input: training data and stochastic model pθ(X ,W )input: with free model parameters θ
I output: “optimal” model parameters θI optimality defined via training criterion
θ := arg maxθ{F (θ)}
Unified training criterion [Macherey+ 2005]
F (θ) =R∑
r=1
f
(log
(∑W p(W )pθ(Xr |W ) · A(W ,Wr )∑
W∈Mrp(W )pθ(Xr |W )
))I covers maximum mutual information (MMI), minimum
classification error (MCE), minimum phone/word error(MPE/MWE)
I control set Mr of competing hypotheses, cost function,smoothing function, scaling of models (not shown)
Schluter: Introduction to Discriminative Training in Speech Recognition 11 January 14, 2010
OutlineIntroduction
Training CriteriaNotationGeneral ApproachProbabilistic Training CriteriaError-Based Training CriteriaPractical IssuesComparative Experimental Results
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 12 January 14, 2010
Probabilistic Training Criteria
Objective
I find good estimate of probability distribution
I optimality regarding error via Bayes’ decoding(asymptotic w.r.t. amount of training data)
Schluter: Introduction to Discriminative Training in Speech Recognition 13 January 14, 2010
Maximum Likelihood (ML)
I optimization of joint probability
arg maxθ
∑r
log(p(Wr )pθ(X |Wr )
)= arg max
θ
∑r
log pθ(Xr |Wr )
I Tutorial on HMM [Rabiner 1989].
I Maximization of probability of reference word sequences(classes).
I Model correctness important.
I HMM: maximization for each class separately.
I Neglects competing classes.
I Expectation-maximiation: local convergence guaranteed.
I Estimation efficient, easily parallelizable.
Schluter: Introduction to Discriminative Training in Speech Recognition 14 January 14, 2010
Maximum Mutual Information (MMI)I optimization of conditional probability
arg maxθ
∑r
log pθ(Wr |Xr ) = arg maxθ
∑r
logp(Wr )pθ(Xr |Wr )∑
V p(V )pθ(Xr |V )
I Considers competing classes and therefore decision boundariesI Necessitates set of competing classes on training data.I Optimization for standard modeling (HMMs, mixture
distributions): only gradient descent or similar.I Optimization using log-linear modeling: convex problemI First application of MMI for ASR using discrete
HMMs [Bahl+ 1986]:I 2000 isolated words, 18% rel. improvement in word error rate.
I MMI for discrete and continuous probabilitydensities [Brown 1987]:
I isolated E-set letters, 18% rel. improvement in recognition rate.I MMI for discrete and continuous probabilty
densities [Normandin 1991]:I digit strings, up to 50% rel. improvement in string error rate.
Schluter: Introduction to Discriminative Training in Speech Recognition 15 January 14, 2010
OutlineIntroduction
Training CriteriaNotationGeneral ApproachProbabilistic Training CriteriaError-Based Training CriteriaPractical IssuesComparative Experimental Results
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 16 January 14, 2010
Error-Based Training Criteria
Objective: Optimize some error measure directly, e.g.:I Empirical recognition error on training data
I Advantage: direct relation to decision ruleI Problem: non-differentiable training criterion, use of
differentiable approximations in practiceI Problem: ASR classes (words/word sequences) difficult to
handle
I Model-based expected error on training dataI Advantage: word or phoneme error easy to handleI Usually, approximated word/phoneme error, but correct edit
distance also is viable [Heigold+ 2005]I Relation to decision rule less straight-forward.I Over-training and generalization becomes an issue
(→ regularization, margin)
Schluter: Introduction to Discriminative Training in Speech Recognition 17 January 14, 2010
Minimum classification error (MCE)
I For ASR: minimization of smoothed empirical sentenceerror [Juang & Katagiri 1992, Chou+ 1992].
arg minθ
1
R
R∑r=1
1
1+[ pαθ (Xr |Wr ) · pα(Wr )∑
W 6=Wr
pαθ (Xr |W ) · pα(W )
]2%
I Smoothing parameters α and %.
I Upper bound to Bayes’ error rate for any acousticmodel [Schluter+ 2001]
I Lesser effect of incorrect model assumptions.
Schluter: Introduction to Discriminative Training in Speech Recognition 18 January 14, 2010
Minimum word/phone error (MWE/MPE)
I minimization of model-based expected word/phone error ontraining data [Povey & Woodland 2002]
arg maxθ
R∑r=1
∑W A(W ,Wr )p(W )pθ(Xr |W )∑
W p(W )pθ(Xr |W )
I Criterion: maximum expected accuracy A(W ,Wr ).
I Accuracy usually approximate, but exact case based on edit(Levenshtein) distance also possible [Heigold+ 2005].
I Regularization (e.g. I-smoothing [Povey & Woodland 2002])necessary due to overtraining.
I Usually better than MMI and MCE.
Schluter: Introduction to Discriminative Training in Speech Recognition 19 January 14, 2010
OutlineIntroduction
Training CriteriaNotationGeneral ApproachProbabilistic Training CriteriaError-Based Training CriteriaPractical IssuesComparative Experimental Results
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 20 January 14, 2010
Practical Issues
I Importance of language model in training of acoustic model.
I Relative and absolute scaling of language and acoustic modelin training.
I Necessity for recognition of training data.
I Efficient calculation of discriminative training statistics usingword lattices.
Schluter: Introduction to Discriminative Training in Speech Recognition 21 January 14, 2010
Language Models for Discriminative TrainingPotential Importance of Language Model Choice:
I language model for recognition of alternative word sequences
I language model dependence of discriminative training criterionitself
I interaction of language model of acoustic model parameters
Correlation hypothesis:only those acoustic models need optimization, which even to-gether with a language model do not sufficiently discriminate.
−→ language model choice would correlate for training andrecognition
Masking hypothesis:language model usually largely improves recognition accuracyand might mask deficiencies of the acoustic models.
−→ suboptimal language models for training would give betterperformance
Schluter: Introduction to Discriminative Training in Speech Recognition 22 January 14, 2010
Language Model for Discriminative TrainingI Discriminative training includes language model.I In training, unigram language model usually leads to the best
word error rates [Schluter+ 1999] (WSJ 5k):
language models criterion word error rates[%]recog train dev eval dev& eval
bi – ML 6.91 6.78 6.86zero MMI 6.71 6.03 6.41uni 6.59 6.00 6.33 -8%bi 6.71 6.20 6.48tri 6.87 6.54 6.72
tri – ML 4.82 4.11 4.51zero MMI 4.63 4.05 4.38uni 4.30 3.64 4.01 -11%bi 4.48 3.94 4.24tri 4.58 4.00 4.33
Schluter: Introduction to Discriminative Training in Speech Recognition 23 January 14, 2010
Scaling of likelihoods
I recognition: absolute scaling of likelihoods irrelevant(language model scale vs. acoustic model scale)
I absolute scaling does have impact on word posteriorcalculation [Wessel+ 1998, Woodland & Povey 2000]
I use language model scale β also in training:
p(X ,W ) = p(W )βpθ(X |W )
I replace p(X ,W ) with:
p(X ,W )γ = p(W )βγpθ(X |W )γ for γ ∈ [0, 1]
I optimum approx. for γ = 1β ., i.e. use
p(X ,W )1β = p(W )pθ(X |W )
1β
I For simplicity here usually omitted in equations.
Schluter: Introduction to Discriminative Training in Speech Recognition 24 January 14, 2010
Competing Word Sequences
I Problem: Exponential number of competing word sequences.I Competing word sequences need to be estimated:
I Hypothesis-generation on training data using recognizer.I Initial lattice generation using recognizer sufficient.I Later acoustic model rescoring constrained to lattice.
I Representation and processing of competing word sequences.I Efficient algorithms to process word lattices.I Generic implementation: weighted finite state transducers.
Schluter: Introduction to Discriminative Training in Speech Recognition 25 January 14, 2010
Competing Word Sequences
History:I best recognized word sequence for MMI (Corrective Training)
[Normandin 1991]:I considers incorrectly recognized training sentences only
I best incorrectly recognized word sequence for MCE[Juang & Katagiri 1992]:
I interpretation of smoothed sentence error still valid
I N-best recognized word sequences for MMI [Chow 1990]:I continuous speech recognition, 1000 wordsI only minor improvements in word error rate
I word graphs from recognition for MMI training[Valtchev+ 1997]:
I large vocabulary, 64k wordsI efficient implementationI 5-10% relative improvement in word error rate
Schluter: Introduction to Discriminative Training in Speech Recognition 26 January 14, 2010
OutlineIntroduction
Training CriteriaNotationGeneral ApproachProbabilistic Training CriteriaError-Based Training CriteriaPractical IssuesComparative Experimental Results
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 27 January 14, 2010
Comparative Experimental Results
WER [%]SieTill WSJ 5k EPPS English Mandarin BN/BC
Crit. Test Dev Evl Dev06 Evl06 Evl07 Dev07 Evl06
ML 1.81 4.55 3.74 14.4 10.8 12.0 15.1 21.9MMI 1.79 4.07 3.53 13.8 11.0 12.0 14.4 20.8MCE 1.69 4.02 3.47 13.8 11.0 11.9MWE 3.98 3.44MPE 4.17 3.62 13.4 10.2 11.5 14.2 20.6
I SieTill [Schluter 2000]
I WSJ 5k [Macherey 2010]
I EPPS/broadcasts [Heigold 2010]
Schluter: Introduction to Discriminative Training in Speech Recognition 28 January 14, 2010
Outline
Introduction
Training Criteria
Parameter OptimizationMotivationGradient descentRpropFormal gradient of MMIFormal gradient of MPE
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 29 January 14, 2010
Motivation
Goal: optimization method for discriminative training criteria F (θ) w.r.t.Goal: set of parameters θ which provides reasonable convergence.
I Various approaches, e.g.:I extended Baum-Welch (EBW) [Normandin 1991]I gradient descent, study: e.g. [Valtchev 1995]I MMI with log-linear models: generalized iterative scaling (GIS)I generalization of GIS to log-linear models with hidden variables
and further criteria like MPE and MCE [Heigold+ 2008a]
I Problems:I robust setting of step sizes/iteration constants (EBW and
gradient descent),I convergence speed (especially GIS).
Schluter: Introduction to Discriminative Training in Speech Recognition 30 January 14, 2010
Extended Baum-Welch
I Motivated by a growth transformation [Gopalakrishnan+ 1991]
I Widely used for discriminative training of Gaussian mixtureHMMs, e.g. [Normandin 1991, Valtchev+ 1997,Schluter 2000, Woodland & Povey 2002]
I Highly optimized heuristics for finding right order ofmagnitude for iteration constants.
I Training of Gaussian mixture HMMs: require positivevariances to obtain estimate for iteration constants.
Schluter: Introduction to Discriminative Training in Speech Recognition 31 January 14, 2010
Outline
Introduction
Training Criteria
Parameter OptimizationMotivationGradient descentRpropFormal gradient of MMIFormal gradient of MPE
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 32 January 14, 2010
Gradient descent
Follow gradient to optimize parameter:
θ = θ + γ∇θFθ
Step sizes:
I heuristic, e.g. for MCE [Chou+ 1992])
I by comparison to EBW [Schluter 2000]
Convergence:
I local optimum
I better convergence: general purpose approaches,e.g. Qprop, Rprop, or L-BFGS, for experimental comparisonssee [McDermott & Katagiri 2005, McDermott+ 2007,Gunawardana+ 2005, Mahajan+ 2006]
Schluter: Introduction to Discriminative Training in Speech Recognition 33 January 14, 2010
Outline
Introduction
Training Criteria
Parameter OptimizationMotivationGradient descentRpropFormal gradient of MMIFormal gradient of MPE
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 34 January 14, 2010
Rprop [Riedmiller & Braun 1993]
General purpose gradient based optimization:
I assume iteration n
I parameter update:
θ(n+1)i = θ
(n)i + γ
(n)i sign
(∂F (θ(n))
∂θi
)I update of step sizes γ
(n)i :
γ(n+1)i =
min{γ(n)
i · η+, γmax} if ∂F (θ(n))∂θi
· ∂F (θ(n−1))∂θi
> 0
max{γ(n)i · η−, γmin} if ∂F (θ(n))
∂θi· ∂F (θ(n−1))
∂θi< 0
γ(n)i otherwise
I η+ ∈ (1,∞), η− ∈ (0, 1)
Schluter: Introduction to Discriminative Training in Speech Recognition 35 January 14, 2010
Outline
Introduction
Training Criteria
Parameter OptimizationMotivationGradient descentRpropFormal gradient of MMIFormal gradient of MPE
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 36 January 14, 2010
Formal gradient of MMII notation:
I W : word sequence w1, . . . ,wN
I r : index of training segment/utterance given by (Xr ,Wr )I Xr : acoustic observation vector sequence xr1, . . . , xrT
I Wr : reference/spoken word sequence wr1, . . . ,wrN
I sT1 : HMM state sequence s1, . . . , sT
I MMI training criterion:
FMMI(θ) =∑
r
log
(p(Wr )pθ(Xr |Wr )∑
W
p(W )pθ(Xr |W )
)
=∑
r
(log p(Wr )pθ(Xr |Wr )− log
∑W
p(W )pθ(Xr |W )
)I acoustic model (HMM):
pθ(Xr ,W ) =∑sTr
1
Tr∏t=1
p(st |st−1)pθ(xrt |st)
Schluter: Introduction to Discriminative Training in Speech Recognition 37 January 14, 2010
Derivative of MMI w.r.t. Parameter
Gradient of MMI criterion:
∇θFMMI(θ) =∑
r
(∇θ log pθ(Xr |Wr )
−
∑W
p(W )pθ(Xr |W )∇θ log pθ(Xr |W )∑W ′
p(W ′)pθ(Xr |W ′)
)
For efficient evaluation, consider derivative of acoustic model,∇θ log pθ(Xr |W ).
Schluter: Introduction to Discriminative Training in Speech Recognition 38 January 14, 2010
Derivative of MMI w.r.t. ParameterDerivative of acoustic model:
∇θ log pθ(Xr ,W ) = ∇θ log∑
sTr1 :W
Tr∏t=1
pθ(xrt |st)p(st |st−1)
=Tr∑t=1
∑sTr
1 :W
(∇θ log pθ(xrt |st)
)·∏Tr
t′=1 pθ(xrt′ |st′)p(st′ |st′−1)∑σTr
1 :W
∏Trτ=1 pθ(xrτ |sτ )p(sτ |sτ−1)
=Tr∑t=1
∑s
(∇θ log pθ(xrt |s)
)·
∑sTr
1 :st=spθ(Xr , s
Tr1 |W )
pθ(Xr |W )
=Tr∑t=1
∑s
γrt(s|W ) · ∇θ log pθ(xrt |s)
with the word sequence conditioned state posterior (occupancy):
γrt(s|W ) =
∑sTr
1 :st=spθ(Xr , s
Tr1 |W )
pθ(Xr |W )= pθ,t(s|Xr ,W )
Schluter: Introduction to Discriminative Training in Speech Recognition 39 January 14, 2010
Derivative of MMI w.r.t. Parameterresubstitute derivative of acoustic model into derivative of MMI criterion:
∇θFMMI(θ) =∑
r
Tr∑t=1
∑s
(∇θ log pθ(xrt |s)
)·
·(γrt(s|Wr )−
∑W
p(W )pθ(Xr |W )γrt(s|W )∑W ′
p(W ′)pθ(Xr |W ′)
)
=∑
r
Tr∑t=1
∑s
(∇θ log pθ(xrt |s)
)·(γrt(s|Wr )− γrt(s)
)with the general state posterior (occupancy):
γrt(s) =
∑W
p(W )pθ(Xr |W )γrt(s|W )∑W ′
p(W ′)pθ(Xr |W ′)= pθ,t(s|Xr )
Schluter: Introduction to Discriminative Training in Speech Recognition 40 January 14, 2010
Efficient Calculation of State OccupanciesIn general:
I efficient calculation of spoken word sequence conditional stateoccupancy γrt(s|Wr ): forward-backward state probabilities ontrellis of word sequence
I efficient calculation of general state occupancy γrt(s):forward-backward probabilities on trellis of word lattice
Viterbi approximation:I γrt(s|W ) = δs,srt(W ) with forced alignment
Sr (W ) = sr1(W ), . . . , srTr (W ) of spoken word sequenceI assume a (word) lattice Mr for utterance r , with edges ω
representing a word w(ω) (in context) with start time ts(ω)and end time te(ω), and a corresponding forced alignmentstets (ω). An edge sequence W ∈Mr then corresponds to the
word sequence W (W). Consequently, the language model andacoustic model can also be defined for an edge sequence,which then might specify word boundaries, phonetic andlanguage model context.
Schluter: Introduction to Discriminative Training in Speech Recognition 41 January 14, 2010
Word Posterior ProbabilitiesFor the general state occupancy in Viterbi approximation we obtain:
γrt(s) =
∑W
p(W )pθ(Xr |W )δs,srt(W )
pθ(Xr )
=∑ω
δs,srt(ω)
∑W:ω∈W
p(W)pθ(Xr |W)
pθ(Xr )
=∑ω
δs,srt(ω)p(ω|Xr )
with the edge (or word in context) posterior
p(ω|Xr ) =∑W:ω∈W
p(W)pθ(Xr |W)
pθ(Xr )
A forward-backward algorithm is used to efficiently compute edge(word in context) posterior probabilities using word lattices.
Schluter: Introduction to Discriminative Training in Speech Recognition 42 January 14, 2010
Outline
Introduction
Training Criteria
Parameter OptimizationMotivationGradient descentRpropFormal gradient of MMIFormal gradient of MPE
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 43 January 14, 2010
Formal gradient of MPE
I A(W ,Wr ): accuracy (negated error) between string W and Wr
I example (MPE): approximate phone accuracy[Povey & Woodland 2002]
I expectation of accuracy:
Eθ[A(·,Wr )] :=∑W
A(W ,Wr ) · p(W )pθ(Xr |W )∑W ′
p(W ′)pθ(Xr |W ′)
I MPE training criterion:
FMPE(θ) =∑
r
Eθ[A(·,Wr )]
Schluter: Introduction to Discriminative Training in Speech Recognition 44 January 14, 2010
Derivative of MPE w.r.t. Parameter
Derivative of MPE criterion:
∇θpθ(Xr |W ) = pθ(Xr |W ) ·(∇θ log pθ(Xr |W )
)∇θFMPE(θ) =
∑r
∑W
(A(W ,Wr )− Eθ[A(·,Wr )]
)·(∇θ log pθ(Xr |W )
)·
· p(W )pθ(Xr |W )∑W ′
p(W ′)pθ(Xr |W ′)
For efficient evaluation, consider derivative of acoustic model:
∇θ log pθ(Xr |W ) =Tr∑t=1
∑s
(∇θ log pθ(xrt |s)
)·
∑sTr
1 :st=spθ(Xr , s
Tr1 |W )
pθ(Xr |W )
Schluter: Introduction to Discriminative Training in Speech Recognition 45 January 14, 2010
Derivative of MPE w.r.t. Parameter
resubstitute derivative of acoustic model into derivative of MPE criterion:
∇θFMPE(θ) =∑
r
Tr∑t=1
∑s
(∇θ log pθ(xrt |s)
)· γrt(s)
with the general state accuracy:
γrt(s) =∑W
(A(W ,Wr )−Eθ[A(·,Wr )]
)·
∑sTr
1 :st=s
p(W )pθ(Xr , sTr1 |W )
∑W ′
p(W ′)pθ(Xr |W ′)
which can be computed efficiently, similar to the case of generalstate occupancies.
Schluter: Introduction to Discriminative Training in Speech Recognition 46 January 14, 2010
Efficient Calculation of State Accuracy
In general:
I assumption: A(W ,Wr ) =Tr∑t=1
A(srt(W ), srt(Wr ))
I example: approximate phone accuracy[Povey & Woodland 2002]
I efficient calculation of general state accuracy γrt(s):forward-backward accuracies on trellis of word lattice[Povey & Woodland 2002]
Schluter: Introduction to Discriminative Training in Speech Recognition 47 January 14, 2010
Word Posterior AccuraciesFor the general state accuracy we in Viterbi approximation obtain:
γrt(s) =
∑W
(A(W ,Wr − Eθ[A(·,Wr )])
)· p(W )pθ(Xr |W )δs,srt(W )
pθ(Xr )
=∑ω
δs,srt(ω)
∑W :ω∈W
(A(W ,Wr − Eθ[A(·,Wr )])
)· p(W )pθ(Xr |W )
pθ(Xr )
=∑ω
δs,srt(ω)p(ω|Xr )
with the edge (or word in context) posterior accuracies
p(ω|Xr ) =∑
W :ω∈W
(A(W ,Wr − Eθ[A(·,Wr )])
)· p(W )pθ(Xr |W )
pθ(Xr )
Later, an efficient way of computing edge (word in context)posterior accuracies using word lattices will be presented.
Schluter: Introduction to Discriminative Training in Speech Recognition 48 January 14, 2010
Outline
Introduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative StatisticsForward/Backward Probabilities on Word LatticesGeneralized FB Probabilities on WFSTs
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 49 January 14, 2010
Forward/Backward Probabilities on Word LatticesI Let ωs(W) and ωe(W) be the first and last edge of a
continuous edge sequence W on a word lattice.I Assume that the lattice fully encodes the language model context:
p(W (W)) = p(W = ωN1 ) =
N∏n=1
p(ωn|ωn−1)
Let ωri and ωrf be the initial and final edges of a word lattice forutterance r . Then define the following forward (Φ) and backward(Ψ) probabilities on initial and final partial edge sequences on theword lattice respectively:
Φ(ω) =∑
W:ωs(W)=ωri
ωe(W)=ω
p(W)pθ(xrte(W)1 |W)
Ψ(ω) =∑
W:ωs(W)=ω
ωe(W)=ωrf
p(W)pθ(xrTr
ts(W)|W)
Schluter: Introduction to Discriminative Training in Speech Recognition 50 January 14, 2010
Forward/Backward Probabilities on Word Lattices
For the forward probability a recursion formulae can be derived byseparating the last edge from the edge sequence in the summationand ≺ denoting direct predecessor edges:
Φ(ω) =∑
W:ωs(W)=ωri
ωe(W)=ω
p(W)pθ(xrte(W)1 |W)
=∑ω′≺ω
∑W ′:ωs(W ′)=ωri
ωe(W ′)=ω′
p(W ′)p(ω|ω′)pθ(xrte(W ′)1 |W ′)pθ(xr
te(ω)ts(ω)|ω)
=∑ω′≺ω
Φ(ω′)p(ω|ω′)pθ(xrte(ω)ts(ω)|ω).
Using this recursion formula, the forward probabilities can becalculated efficiently on word lattices.
Schluter: Introduction to Discriminative Training in Speech Recognition 51 January 14, 2010
Forward/Backward Probabilities on Word Lattices
Similar to the forward probabilities, a recursion formula can bederived for efficient calculation of the backward probabilities and �denoting direct successor edges:
Ψ(ω) =∑
W:ωs(W)�ωωe(W)=ωrf
p(W)pθ(xrTr
ts(ω)|ωW)
=∑ω′�ω
∑W ′:ωs(W ′)�ω′ωe(W ′)=ωrf
p(ω′|ω)p(W ′)pθ(xrte(ω)ts(ω)|ω)pθ(xr
Tr
ts(ω′)|ω′W ′)
=∑ω′�ω
pθ(xrte(ω)ts(ω)|ω)p(ω′|ω)Ψ(ω′)
Schluter: Introduction to Discriminative Training in Speech Recognition 52 January 14, 2010
Forward/Backward Probabilities on Word Lattices
Using the forward and backward probabilities, the edge/wordposterior on a word lattice can be written as
p(ω|Xr ) =
Φ(ω)∑ω′�ω
p(ω′|ω)Ψ(ω′)
Φ(ωrf )
with pθ(Xr ) = Φ(ωrf ) = Ψ(ωri ).
Word posterior probabilities follow naturally from MPE and similardiscriminative training criteria. They also are the basis forconfidence measures, which are used for unsupervised training,adaptation, or dialog systems. They are also part of approximateapproaches to Bayes’ decision rule with word error cost, likeconfusion networks [Mangu+ 1999], or minimum frame worderror [Wessel+ 2001a, Hoffmeister+ 2006].
Schluter: Introduction to Discriminative Training in Speech Recognition 53 January 14, 2010
Outline
Introduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative StatisticsForward/Backward Probabilities on Word LatticesGeneralized FB Probabilities on WFSTs
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 54 January 14, 2010
FB Probabilities: Generalization to WFSTs
Replace word lattice with WFST
I edge label: word with pronunciation
I weight of edge ω: p ← p(ω|ω′) · pθ(xrte(ω)ts(ω)|ω)
I semiring: substitute arithmetic operations (multiplication,addition, inversion) with operations of probability semiring
Semiring IK p ⊕ p′ p ⊗ p′ 0 1 inv(p)
probability IR+ p + p′ p · p′ 0 1 1p
Example WFST from SieTill, Wr =”drei sechs neun” (in red)
1
2
3
4
5
6
7 8 90
neun /9//−876null /0//−424drei /3//−384
[SIL] /si//−2618
sechs /6//−1014
drei /3//−1013
drei /3//−909
zwei /2//−556
neun /9//−480
[SIL] /si//−706
[SIL] /si//−274
[SIL] /si//−632
[SIL] /si//−719
sieben /7//67
sieben /7//−568
fünf /5//−437
Schluter: Introduction to Discriminative Training in Speech Recognition 55 January 14, 2010
FB Probabilities: Generalization to WFSTs
Forward probabilities (pre(ω) ∈ W such that pre(ω) ≺ ω)
Φ(ω) :=⊕
W:ωs(W)=ωri
ωe(W)=ω
⊗ω∈W
p(ω|pre(ω))⊗ pθ(xrte(ω)ts(ω)|ω)
=⊕ω′≺ω
Φ(ω′)⊗ p(ω|ω′)⊗ pθ(xrte(ω)ts(ω)|ω)
Backward probabilities: similar
Using the forward and backward probabilities, the edge posterioron a WFST Xr can be written as
p(ω|Xr ) = Φ(ω)⊗( ⊕ω′�ω
p(ω′|ω)⊗Ψ(ω′)
)⊗ inv(Φ(ωrf ))
Schluter: Introduction to Discriminative Training in Speech Recognition 56 January 14, 2010
Expectation Semiring
vector weight (p, v) of edge ω with
I p ← p(ω|ω′) · pθ(xrte(ω)ts(ω)|ω)
I v ← A(ω) · pI accuracy of edge ω such that
⊗ω∈W A(ω) = A(W,Wr )
I approximate phone accuracy [Povey & Woodland 2002] can bedecomposed in this way
I such a decomposition not possible in general
expectation semiring [Eisner 2001]:vector semiring whose first component is a probability semiring
Semiring IK (p, v)⊕ (p′, v′) (p, v)⊗ (p′, v′) 0 1 inv(p, v)
expectation IR+ × IR (p + p′, v + v′) (p · p′, p · v′ + p′ · v) (0, 0) (1, 0)
„1p,− v
p2
«
Schluter: Introduction to Discriminative Training in Speech Recognition 57 January 14, 2010
Edge Posteriors & Expectation Semiringprobability semiring
I word posterior probabilities (see MMI derivative) identical toedge posteriors using probability semiring
p(ω|Xr ) = pprobability(ω|Xr )
I intuitive and classical result [Rabiner 1989]
expectation semiringI word posterior accuracies (see MPE derivative) identical to
v -component of edge posteriors using expectationsemiring [Heigold+ 2008b]
p(ω|Xr ) = pexpectation,v (ω|Xr )
I also use this identity to efficiently calculateI derivative of unified training criterionI covariance between two random additive variables (related to
MPE derivative)
Schluter: Introduction to Discriminative Training in Speech Recognition 58 January 14, 2010
Outline
Introduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear ModelingTransformation: Gaussian into Log-Linear ModelTransformation from Log-Linear Model into Gaussian
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 59 January 14, 2010
Definition of Models
assume feature vector x ∈ IRD and class c ∈ {1, . . . ,C}
Gaussian modelN (x |µc ,Σc) with
I means µc ∈ IRD
I positive-definite covariancematrices Σc ∈ IRD×D
induces posterior pθ(c|x)
p(c)N (x |µc ,Σc)∑c ′
p(c ′)N (x |µc ′ ,Σc ′)
I include priors p(c) ∈ IR+
Log-linear model with uncon-strained parameters
I λc0 ∈ IRI λc1 ∈ IRD
I λc2 ∈ IRD×D
exp(x>λc2x + λ>c1x + λc0
)∑c ′
exp(x>λc ′2x + λ>c ′1x + λc ′0
)
Schluter: Introduction to Discriminative Training in Speech Recognition 60 January 14, 2010
Transformation: Gaussian into Log-Linear Model
Comparison of terms quadratic, linear, and constant inobservations x leads to the transformation rules[Saul & Lee 2002, Gunawardana+ 2005]:
1. λc2 = −12 Σ−1
c
2. λc1 = Σ−1c µc
3. λc0 = −12
(µ>c Σ−1
c µc + log |2πΣc |)
+ log p(c)
Schluter: Introduction to Discriminative Training in Speech Recognition 61 January 14, 2010
Outline
Introduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear ModelingTransformation: Gaussian into Log-Linear ModelTransformation from Log-Linear Model into Gaussian
Convex Optimization
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 62 January 14, 2010
Transformation: Log-Linear into Gaussian ModelI invert transformation from Gaussian to log-linear model
1. Σc = −12λ−1c2
2. µc = Σ−1c λc1
3. p(c) = exp(λc0 + 1
2
(µ>c Σ−1
c µc + log |2πΣc |))
I problem: parameter constraints not satisfied in generalI covariance matrices Σc must be positive-definiteI priors p(c) must be normalized
I solution: model parameters for posterior are ambiguouse.g. for ∆λ2 ∈ IRD×D ,∆λ0 ∈ IR
exp(x>(λc2 + ∆λ2)x + λ>c1x + (λc0 + ∆λ0)
)∑c ′
exp(x>(λc ′2 + ∆λ2)x + λ>c ′1x + (λc ′0 + ∆λ0)
)=
exp(x>λc2x + λ>c1x + λc0
)∑c ′
exp(x>λc ′2x + λ>c ′1x + λc ′0
)Schluter: Introduction to Discriminative Training in Speech Recognition 63 January 14, 2010
Transformation: Log-Linear into Gaussian Model
invert transformation rules for transformed log-linear model
1. Σc = −12 (λc2 + ∆λ2)−1
2. µc = Σ−1c λc1
3. p(c) = exp((λc0 + ∆λ0) + 1
2
(µ>c Σ−1
c µc + log |2πΣc |))
use additional degrees of freedom to impose parameter constraints
I choose ∆λ2 ∈ IRD×D such that λc2 + ∆λ2 arenegative-definite
I choose ∆λ0 such that p(c) is normalized, i.e.,
∆λ0 := − log∑
c
exp
(λc0 +
1
2
(µ>c Σ−1
c µc + log |2πΣc |))
Schluter: Introduction to Discriminative Training in Speech Recognition 64 January 14, 2010
Outline
Introduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex OptimizationMotivationConvex Training Criteria in Speech RecognitionExperimental Results
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 65 January 14, 2010
Motivation
Conventional approach:
I depends on initialization and choice of optimization algorithm
I spurious local optima (non-convex training criterion)
I many heuristics required
I i.e., involves much engineering work
“Fool-proof” approach:
I unique optimum (independent of initialization)
I accessibility of global optimum (convex training criterion)
I joint optimization of all model parameters, no parameters tobe tuned
Schluter: Introduction to Discriminative Training in Speech Recognition 66 January 14, 2010
Outline
Introduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex OptimizationMotivationConvex Training Criteria in Speech RecognitionExperimental Results
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 67 January 14, 2010
Assumptions
Assumptions to cast HCRF into CRF
I log-linear parameterization, e.g.p(x |s) = exp
(x>λs2x + λ>s1x + λs0
)and p(s|s ′) = exp(αs′s)
I MMI-like training criterion
I alignment represents spoken sequence
I alignment of spoken sequence known and kept fixed
I use single densities with augmented features instead of mixtures
I exact normalization constant
Schluter: Introduction to Discriminative Training in Speech Recognition 68 January 14, 2010
Lattice-Based MMI
Flattice(λ) =∑
r
∑sTr
1 ∈Nr
pλ(xTr1 , sTr
1 )
∑sTr
1 ∈Dr
pλ(xTr1 , sTr
1 )
I numerator word lattice Nr : state sequences sT1 representing
correct hypothesisI denominator word lattice Dr : correct and competing state
sequences, use word pair approximation and pruningI non-convex
word lattice:
fünf
fünf
fünfeins
eins
eins
sechs
[SIL]
zwei
neun
sechs[SIL]
acht
[SIL]acht
[SIL]
acht
acht
sechs
[SIL]
eins
drei
0 44152 277
[SIL]
114
115
112
81
78
74
75
Schluter: Introduction to Discriminative Training in Speech Recognition 69 January 14, 2010
Fool-Proof MMI
Ffool(λ) =∑
r
pλ(xTr1 , sTr
1 )∑sTr
1 ∈Sr
pλ(xTr1 , sTr
1 )
I consider only best state sequence sT1 in numerator, kept fixed
I sum over full state sequence network in denominatorI convex
0: ε
0: ε
ε ε:
ε ε:
0: ε
ε ε:
ε29:
0: ε
ε7:
0: ε
ε ε:
18:ε
ε29:
ε29:
ε ε:
0: ε
ε ε:
ε7:
ε ε:
ε7:
18:ε
ε29:
ε29:
51:ε
ε62:
ε84:
ε73:
ε29:
ε40:
18:ε
ε95:
ε40:
ε29:
ε106:
ε73:
ε62:
51:ε51:ε
ε62:
51:εε62:
ε73:
ε40:
ε84:
0: ε
18:ε
ε29:
ε29:
ε62:
51:ε
ε40:
ε29:
ε29:
ε40:
18:ε
ε7:ε7:
ε ε:
ε29:
ε29:
ε7:
0: ε
15
17
17
16
7:acht
18
18
18
18
18:acht
19
19
19
19
1918:acht
18:acht
20
20
20
20
20
7:acht
21
21
21
21
21
18:acht
7:acht
22
22
22
22
22
22
23
23
23
23
23
23
23
23
24
24
24
24
24
24
24
24
24
25
25
25
25
25
25
25
25
25
18:acht
18:acht
18:acht
18:acht
Part of (pruned) HMM state network
Schluter: Introduction to Discriminative Training in Speech Recognition 70 January 14, 2010
Frame-Based MMI
Fframe(λ) =∑
r
Tr∑t=1
pλ(xt , st)S∑
s=1
pλ(xt , s)
I frame discrimination, cf. hybrid approach
I assume alignment for numerator sT1 , kept fixed
I summation over all HMM states s ∈ {1, . . . ,S} indenominator
I convex
Schluter: Introduction to Discriminative Training in Speech Recognition 71 January 14, 2010
Refinements to MMI
These refinements do not break convexity:
I `2-regularization
I margin term
Schluter: Introduction to Discriminative Training in Speech Recognition 72 January 14, 2010
Outline
Introduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex OptimizationMotivationConvex Training Criteria in Speech RecognitionExperimental Results
Incorporation of Margin Concept
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 73 January 14, 2010
Initialization
Analyze effect of initialparameters on training.
I vary initialization fordifferent trainingcriteria
I experiments: digitstrings (SieTill,German, telephone)
-4000
-3000
-2000
-1000
0
0 50 100 150 200 250 300
F
iteration index
frame-based M-MMIlattice-based M-MMI
fool-proof M-MMIfrom scratch
from MLfrom frame
1.5
2
2.5
3
3.5
4
4.5
5
0 50 100 150 200 250 300
WE
R [%
]
iteration index
frame-based M-MMIlattice-based M-MMI
fool-proof M-MMIfrom scratch
from MLfrom frame
Schluter: Introduction to Discriminative Training in Speech Recognition 74 January 14, 2010
Read Speech (WSJ)
I 5k-vocabulary, trigram language model
I phone-based HMMs, 1,500 CART-tied triphones
I audio data: 15h (training), 0.4h (test)
I log-linear model with kernel-like features f (x)I first (fd(x) = xd) and second (fdd′(x) = xd · xd′) order featuresI cluster features: assume GMM of marginal distribution,
p(x) =∑
l p(x , l)
fl(x) =
{p(l |x) if p(l |x) ≥ threshold
0 otherwise
I starting from scratch (model) and linear segmentation
I frame-based MMI, with re-alignments
I details: [Wiesler+ 2009]
Schluter: Introduction to Discriminative Training in Speech Recognition 75 January 14, 2010
Read Speech (WSJ)
Feature setup WER [%]
First order features, monophones 22.7+second order features 10.3+210 cluster features + temporal context of size 9 6.2+1,500 CART-tied HMM states (triphones) 3.9+realignment 3.6GHMM (ML) 3.6
(MMI) 3.0
Schluter: Introduction to Discriminative Training in Speech Recognition 76 January 14, 2010
OutlineIntroduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin ConceptMotivationSupport Vector Machines (Hinge Loss)Smooth Approximation to SVM: Margin-MMISupport Vector Machines (Margin Error)Smooth Approximation to SVM: Margin-MPEExperimental Evaluation of Margin
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 77 January 14, 2010
MotivationGoal: incorporation of margin term into conventional training criteria
I replace likelihoods p(W )p(X |W ) with margin-likelihoodsp(W )p(X |W ) exp(−ρA(W ,Wr ))
I A(W ,Wr ): accuracy between hypothesis W and reference Wr
I interpretation (boosting):emphasize incorrect hypotheses by up-weighting
I interpretation (large margin): next slides
Margin
low complexity taskdifferent
loss functions
convergence local optima
different parameterization optimization
Margin in trainingis promising.
Individual contribution of margin in LVCSR training?
Schluter: Introduction to Discriminative Training in Speech Recognition 78 January 14, 2010
OutlineIntroduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin ConceptMotivationSupport Vector Machines (Hinge Loss)Smooth Approximation to SVM: Margin-MMISupport Vector Machines (Margin Error)Smooth Approximation to SVM: Margin-MPEExperimental Evaluation of Margin
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 79 January 14, 2010
Support Vector Machines (Hinge Loss)
Optimization problem for SVMs
SVM(λ) = −C
2‖λ‖2 −
R∑r=1
l(Wr , dr ; ρ)
I feature functions f (X ,W ), model parameters λ
I distance drW := λ>(f (Xr ,Wr )− f (Xr ,W ))
I hinge loss function l (hinge)(Wr , dr ; ρ) :=maxW 6=Wr {max {−drW + ρ(A(Wr ,Wr )− A(W ,Wr )), 0}}
I `2-regularization with constant C > 0
I [Altun+ 2003, Taskar+ 2003]
Schluter: Introduction to Discriminative Training in Speech Recognition 80 January 14, 2010
OutlineIntroduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin ConceptMotivationSupport Vector Machines (Hinge Loss)Smooth Approximation to SVM: Margin-MMISupport Vector Machines (Margin Error)Smooth Approximation to SVM: Margin-MPEExperimental Evaluation of Margin
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 81 January 14, 2010
Smooth Approximation to SVM: Margin-MMI
Margin-based/modified MMI (M-MMI)
FM-MMI,γ(λ) =− C
2‖λ‖2
+R∑
r=1
1
γlog
(exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑W exp(γ(λ>f (Xr ,W )− ρA(W ,Wr )))
)
Lemma: FM-MMI,γγ→∞→ SVMhinge (pointwise convergence).
I [Heigold+ 2008b]
Schluter: Introduction to Discriminative Training in Speech Recognition 82 January 14, 2010
Proof
∆A(W ,Wr ) := A(Wr ,Wr )− A(W ,Wr )
−1
γlog
exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑W
exp(γ(λ>f (Xr ,W )− ρA(W ,Wr )))
=
1
γlog
1 +∑
W 6=Wr
exp(γ(−drW + ρ∆A(W ,Wr )))
γ→∞→
maxW 6=Wr
{−drW + ρ∆A(W ,Wr )} if ∃W 6= Wr : drW < ρ∆A(W ,Wr )
0 otherwise
= maxW 6=Wr
{max{−drW + ρ∆A(W ,Wr ), 0}}
=: l (hinge)(Wr , dr ; ρ).
Schluter: Introduction to Discriminative Training in Speech Recognition 83 January 14, 2010
OutlineIntroduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin ConceptMotivationSupport Vector Machines (Hinge Loss)Smooth Approximation to SVM: Margin-MMISupport Vector Machines (Margin Error)Smooth Approximation to SVM: Margin-MPEExperimental Evaluation of Margin
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 84 January 14, 2010
Support Vector Machines (Margin Error)
Optimization problem for SVMs
SVM(λ) = −C
2‖λ‖2 −
R∑r=1
l(Wr , dr ; ρ)
I feature functions f (X ,W ), model parameters λ
I distance drW := λ>(f (Xr ,Wr )− f (Xr ,W ))
I margin error loss functionl (error)(Wr , dr ; ρ) := E
(A(arg minW [drW + ρA(W ,Wr )],Wr )
)I `2-regularization with constant C > 0
I [Heigold+ 2008b]
Schluter: Introduction to Discriminative Training in Speech Recognition 85 January 14, 2010
OutlineIntroduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin ConceptMotivationSupport Vector Machines (Hinge Loss)Smooth Approximation to SVM: Margin-MMISupport Vector Machines (Margin Error)Smooth Approximation to SVM: Margin-MPEExperimental Evaluation of Margin
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 86 January 14, 2010
Smooth Approximation to SVM: Margin-MPE
Margin-based/modified MPE (M-MPE)
FM-MPE,γ(λ) = −C
2‖λ‖2
+R∑
r=1
∑W
E (W ,Wr )
(exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑
V exp(γ(λ>f (Xr ,V )− ρA(V ,Wr )))
)
Lemma: FM-MPE,γγ→∞→ SVMerror .
I [Heigold+ 2008b]
Schluter: Introduction to Discriminative Training in Speech Recognition 87 January 14, 2010
OutlineIntroduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin ConceptMotivationSupport Vector Machines (Hinge Loss)Smooth Approximation to SVM: Margin-MMISupport Vector Machines (Margin Error)Smooth Approximation to SVM: Margin-MPEExperimental Evaluation of Margin
Conclusions
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 88 January 14, 2010
Experimental Evaluation of Margin
Digit strings (SieTill, German, telephone)
dns/state feature orders # param, criterion WER [%]
1 first 11k ML 3.8MMI 2.9M-MMI 2.7
64 first 690k ML 1.8MMI 1.8M-MMI 1.6
1 first, second, 1,409k Frame 1.8and third MMI 1.7
M-MMI 1.5
Schluter: Introduction to Discriminative Training in Speech Recognition 89 January 14, 2010
Experimental Evaluation of Margin
European parliament plenary sessions in English (EPPS) andMandarin broadcasts
WER [%]EPPS En Mandarin BN/BC
Criterion 90h 230h 1500h
ML 12.0 21.9 17.9
MMI 20.8M-MMI 20.6
MPE 11.5 20.6 16.5M-MPE 11.3 20.3 16.3
Schluter: Introduction to Discriminative Training in Speech Recognition 90 January 14, 2010
Experimental Evaluation of Margin
Handwriting Recognition (IFN/ENIT)
I isolated town names, handwritten
I choose slice features to use 1D HMM
I details: see [Dreuw+ 2009]
WER [%]Criterion abc-d abd-c acd-b bcd-a abcd-e
ML 7.8 8.7 7.8 8.7 16.8
MMI 7.4 8.2 7.6 8.4 16.4M-MMI 6.1 6.8 6.1 7.0 15.4
Schluter: Introduction to Discriminative Training in Speech Recognition 91 January 14, 2010
Outline
Introduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
ConclusionsEffective Discriminative Training
Annex
Schluter: Introduction to Discriminative Training in Speech Recognition 92 January 14, 2010
Effective Discriminative TrainingI Discriminative Criteria
I fit decision rule: minimize training errorI limit overfitting: include regularization and margin to exploit
limit overfitting: remaining degrees of freedom of the parameters
I Optimization MethodsI general purpose methods give robust estimatesI in convex case gradient descent still faster than
growth transform (GIS)I Log-Linear Modeling
I convex (w/o hidden variables)I covers Gaussians completely, w/o constraints on e.g. varianceI opens modeling up to arbitrary featuresI initialization: from scratch or from Gaussians
I Estimation of StatisticsI efficiency: use word lattice to represent competing word sequencesI implementation: generic approach using WFSTs,
implementation: covers class of criteria
Schluter: Introduction to Discriminative Training in Speech Recognition 93 January 14, 2010
Effective Discriminative TrainingI Discriminative Criteria
I fit decision rule: minimize training errorI limit overfitting: include regularization and margin to exploit
limit overfitting: remaining degrees of freedom of the parametersI Optimization Methods
I general purpose methods give robust estimatesI in convex case gradient descent still faster than
growth transform (GIS)
I Log-Linear ModelingI convex (w/o hidden variables)I covers Gaussians completely, w/o constraints on e.g. varianceI opens modeling up to arbitrary featuresI initialization: from scratch or from Gaussians
I Estimation of StatisticsI efficiency: use word lattice to represent competing word sequencesI implementation: generic approach using WFSTs,
implementation: covers class of criteria
Schluter: Introduction to Discriminative Training in Speech Recognition 93 January 14, 2010
Effective Discriminative TrainingI Discriminative Criteria
I fit decision rule: minimize training errorI limit overfitting: include regularization and margin to exploit
limit overfitting: remaining degrees of freedom of the parametersI Optimization Methods
I general purpose methods give robust estimatesI in convex case gradient descent still faster than
growth transform (GIS)I Log-Linear Modeling
I convex (w/o hidden variables)I covers Gaussians completely, w/o constraints on e.g. varianceI opens modeling up to arbitrary featuresI initialization: from scratch or from Gaussians
I Estimation of StatisticsI efficiency: use word lattice to represent competing word sequencesI implementation: generic approach using WFSTs,
implementation: covers class of criteria
Schluter: Introduction to Discriminative Training in Speech Recognition 93 January 14, 2010
Effective Discriminative TrainingI Discriminative Criteria
I fit decision rule: minimize training errorI limit overfitting: include regularization and margin to exploit
limit overfitting: remaining degrees of freedom of the parametersI Optimization Methods
I general purpose methods give robust estimatesI in convex case gradient descent still faster than
growth transform (GIS)I Log-Linear Modeling
I convex (w/o hidden variables)I covers Gaussians completely, w/o constraints on e.g. varianceI opens modeling up to arbitrary featuresI initialization: from scratch or from Gaussians
I Estimation of StatisticsI efficiency: use word lattice to represent competing word sequencesI implementation: generic approach using WFSTs,
implementation: covers class of criteria
Schluter: Introduction to Discriminative Training in Speech Recognition 93 January 14, 2010
Thanks for your attention!
Schluter: Introduction to Discriminative Training in Speech Recognition 94 January 14, 2010
Outline
Introduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
AnnexReferencesSpeech Tasks: Corpus Statistics & SetupsHandwriting Recognition Tasks
Schluter: Introduction to Discriminative Training in Speech Recognition 95 January 14, 2010
ReferencesY. Altun, I. Tsochantaridis, and T. Hofmann, “Hidden Markov support vector machines,” in International
Conference on Machine Learning (ICML) 2003, Washington, DC, USA, 2003.
L. R. Bahl, P. F. Brown, P. V. de Souza, R. L. Mercer. “Maximum Mutual Information Estimation of
Hidden Markov Model Parameters for Speech Recognition,”Proc. 1986 Int. Conf. on Acoustics, Speech and Signal Processing, Vol. 1, pp. 49-52, Tokyo, Japan, May1986.
P. F. Brown. The Acoustic-Modeling Problem in Automatic Speech Recognition, Ph.D. thesis, Department
of Computer Science, Carnegie Mellon University, Pittsburgh, PA, May 1987.
W. Chou, B.-H. Juang, and C.-H. Lee, “Segmental GPD training of HMM based speech recognizer,” in
IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 1992, San Francisco,CA, USA, March 1992, pp. 473-476.
Y.-L. Chow: “Maximum Mutual Information Estimation of HMM Parameters for Continuous Speech
Recognition using the N-best Algorithm,” inProc. Int. Conf. on Acoustics, Speech and Signal Processing(ICASSP), pp. 701–704, Albuquerque, NM, April 1990.
P. Dreuw, G. Heigold, and H. Ney, “Confidence-based discriminative training for model adaptation in offline
Arabic handwriting recognition,” in International Conference on Document Analysis and Recognition(ICDAR), Barcelona, Spain, July 2009.
J. Eisner, “Expectation semirings: Flexible EM for finite-state transducers,” in International Workshop on
Finite-State Methods and Natural Language Processing (FSMNLP), Helsinki, Finland, August 2001.
P. S. Gopalakrishnan, D. Kanevsky, A. Nadas, D. Nahamoo. “An Inequality for Rational Functions with
Applications to Some Statistical Estimation Problems,” IEEE Transactions on Information Theory, Vol. 37,Nr. 1, pp. 107-113, January 1991.
Schluter: Introduction to Discriminative Training in Speech Recognition 95 January 14, 2010
A. Gunawardana, M. Mahajan, A. Acero, and J. Platt, “Hidden conditional random fields for phone
classification,” in Interspeech, pp. 117 120, Lisbon, Portugal, Sept. 2005.
G. Heigold, W. Macherey, R. Schluter, and H. Ney: ”Minimum Exact Word Error Training,” in Proc. IEEE
Automatic Speech Recognition and Understanding Workshop (ASRU), pages 186–190, San Juan, PuertoRico, November 2005.
G. Heigold, T. Deselaers, R. Schluter, H. Ney: “GIS-like Estimation of Log-Linear Models with Hidden
Variables,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),pages 4045–4048, Las Vegas, NV, USA, April 2008.
G. Heigold, T. Deselaers, R. Schluter, H. Ney, “Modified MMI/MPE: A direct evaluation of the margin in
speech recognition,” in International Conference on Machine Learning (ICML), pp. 384-391, Helsinki,Finland, July 2008.
G. Heigold: A Log-Linear Modeling Framework for Speech Recognition, Doctoral Thesis to be submitted,
RWTH Aachen University, Aachen, Germany, 2010.
B. Hoffmeister, T. Klein, R. Schluter, and H. Ney: “Frame Based System Combination and a Comparison
with Weighted ROVER and CNC,” in Proc. Interspeech, pages 537–540, Pittsburgh, PA, USA, September2006.
B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Transactions
on Signal Processing, vol. 40, no. 12, pp. 3043-3054, 1992.
D. Kanevsky: “Extended Baum Welch transformations for general functions,” in Proc. IEEE International
Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 821-824, Montreal, Quebec,Canada, May 2004.
W. Macherey, L. Haferkamp, R. Schluter, H. Ney: “Investigations on Error Minimizing Training Criteria for
Discriminative Training in Automatic Speech Recognition,” in Proc. European Conference on SpeechCommunication and Technology (Interspeech), Lisbon, September 2005.
Schluter: Introduction to Discriminative Training in Speech Recognition 95 January 14, 2010
W. Macherey, “Discriminative training and acoustic modeling for automatic speech recognition,” Ph.D.
thesis to be submitted, RWTH Aachen University, 2010.
M. Mahajan, A. Gunawardana, A. Acero: “Training algorithms for hidden conditional random fields,” in
Proc IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse,France, May 2006.
L. Mangu, E. Brill, A. Stolcke: “Finding Consensus Among Words: Lattice-Based Word Error
Minimization,” Proc. European Conference on Speech Communication and Technology (EUROSPEECH),pp. 495–498, Budapest, Hungary, Sept. 1999.
E. McDermott, S. Katagiri: “Minimum Classification Error for Large Scale Speech Recognition Tasks using
Weighted Finite State Transducers,” in Proc. IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP), Philadelphia, PA, USA, April 2005.
E. McDermott, T. Hazen, J.L. Roux, A. Nakamura, S. Katagiri: “Discriminative training for large
vocabulary speech recognition using Minimum Classification Error,” in Proc. IEEE Transactions on Audio,Speech and Language Processing (ICASSP), Vol. 15, No. 1, pp. 203–223, April 2007.
Y. Normandin, “Hidden Markov Models, Maximum Mutual Information, and the Speech Recognition
Problem,” Ph.D. thesis, McGill University, Montreal, Canada, 1991.
D. Povey and P. C. Woodland, “Minimum phone error and I- smoothing for improved discriminative
training,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2002,Orlando, FL, May 2002, vol. 1, pp. 105–108.
L.R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” in
Proc. of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.
M. Riedmiller and H. Braun, “A direct adaptive method for faster backpropagation learning: The Rprop
algorithm,” in IEEE International Conference on Neural Networks (ICNN) 1993, San Francisco, CA, USA,1993, pp. 586 591.
Schluter: Introduction to Discriminative Training in Speech Recognition 95 January 14, 2010
L. Saul and D. Lee, “Multiplicative updates for classification by mixture models,” in T.G. Dietterich, S.
Becker, and Z. Ghahramani, editor, Advances in Neural Information Processing Systems (NIPS). MIT Press,2002.
R. Schluter, B. Muller, F. Wessel, and H. Ney: “Interdependence of Language Models and Discriminative
Training,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Vol. 1,pages 119–122, Keystone, CO, December 1999.
R. Schluter: Investigations on Discriminative Trainings Criteria, Doctoral Thesis, RWTH Aachen University,
Aachen, Germany, Sept. 2000.
R. Schluter, H. Ney: ”Model-based MCE Bound to the True Bayes’ Error,” IEEE Signal Processing Letters,
Vol. 8, No. 5, pages 131–133, May 2001.
B. Taskar, C. Guestrin, and D. Koller, “Max-margin Markov networks,” in Advances in Neural Information
Processing Systems (NIPS) 2003, 2003.
V. Valtchev: Discriminative Methods in HMM-based Speech Recognition, Ph.D. thesis, St. John’s College,
University of Cambridge, Cambridge, March 1995.
V. Valtchev, J. J. Odell, P. C. Woodland, S. J. Young. “MMIE Training of Large Vocabulary Recognition
Systems,” Speech Communication, Vol. 22, No. 4, pp. 303-314, September 1997.
F. Wessel, K. Macherey, and R. Schluter, “Using word probabilities as confidence measures,” in IEEE
International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 1998, Seattle, WA, USA,May 1998, pp. 225-228.
F. Wessel, R. Schluter, and H. Ney: “Explicit Word Error Minimization using Word Hypothesis Posterior
Probabilities,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP), pages 33–36, Salt Lake City, Utah, May 2001.
Schluter: Introduction to Discriminative Training in Speech Recognition 95 January 14, 2010
F. Wessel, R. Schluter, H. Ney: “Explicit Word Error Minimization using Word Hypothesis Posterior
Probabilities,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 33–36,Salt Lake City, Utah, May 2001.
S. Wiesler, et al., “Investigations on features for log-linear acoustic models in continuous speech
recognition,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Merano, Italy,Dec. 2009.
P. C. Woodland and D. Povey, “Large scale discriminative training for speech recognition,” in Automatic
Speech Recognition (ASR) 2000, Paris, France, September 2000, pp. 7–16.
P.C. Woodland, D. Povey: “Large Scale Discriminative Training of Hidden Markov Models for Speech
Recognition.” Computer Speech and Language, Vol. 16, No. 1, pp. 2548, 2002.
Schluter: Introduction to Discriminative Training in Speech Recognition 96 January 14, 2010
Outline
Introduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
AnnexReferencesSpeech Tasks: Corpus Statistics & SetupsHandwriting Recognition Tasks
Schluter: Introduction to Discriminative Training in Speech Recognition 96 January 14, 2010
Speech Tasks: Corpus Statistics & Setups
Identifier Description Train/Test [h]
SieTill German digit strings 11/11 (Test)EPPS En English European Parliament 92/2.9 (Evl07)
plenary speechBNBC Cn 230h Mandarin broadcasts 230/2.2 (Evl06)BNBC Cn 1500h Mandarin broadcasts 1,500/2.2 (Evl06)
Identifier #States/#Dns Features
SieTill 430/27k 25 LDA(MFCC)EPPS En 4,500/830k 45 LDA(MFCC+voicing)
+VTLN+SAT/CMLLRBNBC Cn 230h 4,500/1,100k 45 LDA(MFCC)+3 tones
+VTLN+SAT/CMLLRBNBC Cn 1500h 4,500/1,200k 45 SAT/CMLLR(PLP+voicing
+3 tones+32 NN)+VTLN
Schluter: Introduction to Discriminative Training in Speech Recognition 97 January 14, 2010
Outline
Introduction
Training Criteria
Parameter Optimization
Efficient Calculation of Discriminative Statistics
Generalisation to Log-Linear Modeling
Convex Optimization
Incorporation of Margin Concept
Conclusions
AnnexReferencesSpeech Tasks: Corpus Statistics & SetupsHandwriting Recognition Tasks
Schluter: Introduction to Discriminative Training in Speech Recognition 98 January 14, 2010
Handwriting Recognition Tasks
IFN/ENIT:
I isolated Tunisian town names
I 4 training folds + 1 additional fold for testing
I simple appearance-based image slice features
I each fold comprises approximately 500,000 frames
#Observations [k]Corpus Towns Frames
IFN/ENIT a 6.5 452b 6.7 459c 6.5 452d 6.7 451e 6.0 404
Schluter: Introduction to Discriminative Training in Speech Recognition 99 January 14, 2010