Introduction to Discriminative Training in Speech …...2010/01/14 · Schluter: Introduction to...

transcript

Introduction to Discriminative Trainingin Speech Recognition

Ralf Schluter, Georg Heigold

Lehrstuhl fur Informatik 6Human Language Technology and Pattern Recognition

Computer Science Department, RWTH Aachen UniversityD-52056 Aachen, Germany

January 14, 2010

Schluter: Introduction to Discriminative Training in Speech Recognition 1 January 14, 2010

Contents

Introduction

Training Criteria

Parameter Optimization

Efficient Calculation of Discriminative Statistics

Generalisation to Log-Linear Modeling

Convex Optimization

Incorporation of Margin Concept

Conclusions

Outline

IntroductionMotivationOverview

Training Criteria

Convex Optimization

Conclusions

MotivationAim of discriminative methods: improve class separation

I standard maximum likelihood (ML) training: maximizereference class conditional pθ(x |c)

I maximum mutual information (MMI) training: maximize

reference class posterior pθ(c|x) =p(c) · pθ(x |c)∑

p(c ′) · pθ(x |c ′)Where’s the difference?

I Ideally: (almost) no difference! In case of infinite trainingdata and correct model assumptions, the true probabilities areobtained in both cases. They lead to equal decisions, providedthe class prior p(c) is known. (Proof: model free optimization.)

I ML training: classes are handled independently, thereforedecision boundaries are not considered explicitly in training.

I in MMI training and generally in discriminative training, thereference class directly competes against all other classes,decision boundaries become relevant in training.

p(c ′) · pθ(x |c ′)

Where’s the difference?

MotivationI In practice, model assumptions are incorrect, and training

data is limited. Here discriminative training can be beneficial.

Example: a two class problem (with pooled covariance matrix)

-5 -4 -3 -2 -1 0 1 2

ML/MMI

class -1class +1

-5 -4 -3 -2 -1 0 1 2

ML MMI

I Clearly, in case of ML training, the outlier deteriorates thedecision boundary, whereas MMI training registers the minorimportance of the outlier.

I MMI captures decision boundary, although model assumptiondoes not fit in second case (pooled covariance).

-5 -4 -3 -2 -1 0 1 2

ML/MMI

class -1class +1

-5 -4 -3 -2 -1 0 1 2

ML MMI

-5 -4 -3 -2 -1 0 1 2

ML/MMI

class -1class +1

-5 -4 -3 -2 -1 0 1 2

ML MMI

-5 -4 -3 -2 -1 0 1 2

ML/MMI

class -1class +1

-5 -4 -3 -2 -1 0 1 2

ML MMI

-5 -4 -3 -2 -1 0 1 2

ML/MMI

class -1class +1

-5 -4 -3 -2 -1 0 1 2

ML MMI

Outline

IntroductionMotivationOverview

Training Criteria

Convex Optimization

Conclusions

Overview

Questions:

I Which discriminative criterion to take?

I Relation to decision rule and evaluation measure?

I How to optimize criterion?

I Efficiency?

I Influence of modeling?

I Uniqueness of solution?

I Generalization?

Bottomline:

I How to utilize available training materialto obtain optimum recognition performance?

Overview

Questions:

I Efficiency?

I Generalization?

Bottomline:

Overview

Questions:

I Efficiency?

I Generalization?

Bottomline:

Overview

Questions:

I Efficiency?

I Generalization?

Bottomline:

Overview

Questions:

I Efficiency?

I Generalization?

Bottomline:

Overview

Questions:

I Efficiency?

I Generalization?

Bottomline:

Overview

Questions:

I Efficiency?

I Generalization?

Bottomline:

Overview

Questions:

I Efficiency?

I Generalization?

Bottomline:

Overview

Questions:

I Efficiency?

I Generalization?

Bottomline:

Overview

Questions:

I Efficiency?

I Generalization?

Bottomline:

OutlineIntroduction

Training CriteriaNotationGeneral ApproachProbabilistic Training CriteriaError-Based Training CriteriaPractical IssuesComparative Experimental Results

Convex Optimization

Conclusions

Notation

Xr sequence xr ,1, xr ,2, ..., xr ,Tr acoustic observationvectors

Wr spoken word sequence wr ,1,wr ,2, ...,wr ,Nr in trainingutterance r

W any word sequence

p(W ) language model probability, supposed to be given

pθ(Xr |W ) acoustic emission probability/acoustic model

θ set of all parameters of the acoustic model

Mr set of competing word sequences to be considered

f smoothing function

OutlineIntroduction

Convex Optimization

Conclusions

General ApproachTraining

I input: training data and stochastic model pθ(X ,W )input: with free model parameters θ

I output: “optimal” model parameters θI optimality defined via training criterion

θ := arg maxθ{F (θ)}

Unified training criterion [Macherey+ 2005]

F (θ) =R∑

(∑W p(W )pθ(Xr |W ) · A(W ,Wr )∑

W∈Mrp(W )pθ(Xr |W )

))I covers maximum mutual information (MMI), minimum

classification error (MCE), minimum phone/word error(MPE/MWE)

I control set Mr of competing hypotheses, cost function,smoothing function, scaling of models (not shown)

F (θ) =R∑

(∑W p(W )pθ(Xr |W ) · A(W ,Wr )∑

I covers maximum mutual information (MMI), minimumclassification error (MCE), minimum phone/word error(MPE/MWE)

F (θ) =R∑

(∑W p(W )pθ(Xr |W ) · A(W ,Wr )∑

F (θ) =R∑

(∑W p(W )pθ(Xr |W ) · A(W ,Wr )∑

OutlineIntroduction

Convex Optimization

Conclusions

Probabilistic Training Criteria

Objective

I find good estimate of probability distribution

I optimality regarding error via Bayes’ decoding(asymptotic w.r.t. amount of training data)

Maximum Likelihood (ML)

I optimization of joint probability

arg maxθ

log(p(Wr )pθ(X |Wr )

)= arg max

log pθ(Xr |Wr )

I Tutorial on HMM [Rabiner 1989].

I Maximization of probability of reference word sequences(classes).

I Model correctness important.

I HMM: maximization for each class separately.

I Neglects competing classes.

I Expectation-maximiation: local convergence guaranteed.

I Estimation efficient, easily parallelizable.

Maximum Mutual Information (MMI)I optimization of conditional probability

arg maxθ

log pθ(Wr |Xr ) = arg maxθ

logp(Wr )pθ(Xr |Wr )∑

V p(V )pθ(Xr |V )

I Considers competing classes and therefore decision boundariesI Necessitates set of competing classes on training data.I Optimization for standard modeling (HMMs, mixture

distributions): only gradient descent or similar.I Optimization using log-linear modeling: convex problemI First application of MMI for ASR using discrete

HMMs [Bahl+ 1986]:I 2000 isolated words, 18% rel. improvement in word error rate.

I MMI for discrete and continuous probabilitydensities [Brown 1987]:

I isolated E-set letters, 18% rel. improvement in recognition rate.I MMI for discrete and continuous probabilty

densities [Normandin 1991]:I digit strings, up to 50% rel. improvement in string error rate.

OutlineIntroduction

Convex Optimization

Conclusions

Error-Based Training Criteria

Objective: Optimize some error measure directly, e.g.:I Empirical recognition error on training data

I Advantage: direct relation to decision ruleI Problem: non-differentiable training criterion, use of

differentiable approximations in practiceI Problem: ASR classes (words/word sequences) difficult to

handle

I Model-based expected error on training dataI Advantage: word or phoneme error easy to handleI Usually, approximated word/phoneme error, but correct edit

distance also is viable [Heigold+ 2005]I Relation to decision rule less straight-forward.I Over-training and generalization becomes an issue

(→ regularization, margin)

Minimum classification error (MCE)

I For ASR: minimization of smoothed empirical sentenceerror [Juang & Katagiri 1992, Chou+ 1992].

arg minθ

R∑r=1

1+[ pαθ (Xr |Wr ) · pα(Wr )∑

W 6=Wr

pαθ (Xr |W ) · pα(W )

I Smoothing parameters α and %.

I Upper bound to Bayes’ error rate for any acousticmodel [Schluter+ 2001]

I Lesser effect of incorrect model assumptions.

Minimum word/phone error (MWE/MPE)

I minimization of model-based expected word/phone error ontraining data [Povey & Woodland 2002]

arg maxθ

R∑r=1

∑W A(W ,Wr )p(W )pθ(Xr |W )∑

W p(W )pθ(Xr |W )

I Criterion: maximum expected accuracy A(W ,Wr ).

I Accuracy usually approximate, but exact case based on edit(Levenshtein) distance also possible [Heigold+ 2005].

I Regularization (e.g. I-smoothing [Povey & Woodland 2002])necessary due to overtraining.

I Usually better than MMI and MCE.

OutlineIntroduction

Convex Optimization

Conclusions

Practical Issues

I Importance of language model in training of acoustic model.

I Relative and absolute scaling of language and acoustic modelin training.

I Necessity for recognition of training data.

I Efficient calculation of discriminative training statistics usingword lattices.

Language Models for Discriminative TrainingPotential Importance of Language Model Choice:

I language model for recognition of alternative word sequences

I language model dependence of discriminative training criterionitself

I interaction of language model of acoustic model parameters

Correlation hypothesis:only those acoustic models need optimization, which even to-gether with a language model do not sufficiently discriminate.

−→ language model choice would correlate for training andrecognition

Masking hypothesis:language model usually largely improves recognition accuracyand might mask deficiencies of the acoustic models.

−→ suboptimal language models for training would give betterperformance

Language Model for Discriminative TrainingI Discriminative training includes language model.I In training, unigram language model usually leads to the best

word error rates [Schluter+ 1999] (WSJ 5k):

language models criterion word error rates[%]recog train dev eval dev& eval

bi – ML 6.91 6.78 6.86zero MMI 6.71 6.03 6.41uni 6.59 6.00 6.33 -8%bi 6.71 6.20 6.48tri 6.87 6.54 6.72

tri – ML 4.82 4.11 4.51zero MMI 4.63 4.05 4.38uni 4.30 3.64 4.01 -11%bi 4.48 3.94 4.24tri 4.58 4.00 4.33

Scaling of likelihoods

I recognition: absolute scaling of likelihoods irrelevant(language model scale vs. acoustic model scale)

I absolute scaling does have impact on word posteriorcalculation [Wessel+ 1998, Woodland & Povey 2000]

I use language model scale β also in training:

p(X ,W ) = p(W )βpθ(X |W )

I replace p(X ,W ) with:

p(X ,W )γ = p(W )βγpθ(X |W )γ for γ ∈ [0, 1]

I optimum approx. for γ = 1β ., i.e. use

p(X ,W )1β = p(W )pθ(X |W )

I For simplicity here usually omitted in equations.

Competing Word Sequences

I Problem: Exponential number of competing word sequences.I Competing word sequences need to be estimated:

I Hypothesis-generation on training data using recognizer.I Initial lattice generation using recognizer sufficient.I Later acoustic model rescoring constrained to lattice.

I Representation and processing of competing word sequences.I Efficient algorithms to process word lattices.I Generic implementation: weighted finite state transducers.

Competing Word Sequences

History:I best recognized word sequence for MMI (Corrective Training)

[Normandin 1991]:I considers incorrectly recognized training sentences only

I best incorrectly recognized word sequence for MCE[Juang & Katagiri 1992]:

I interpretation of smoothed sentence error still valid

I N-best recognized word sequences for MMI [Chow 1990]:I continuous speech recognition, 1000 wordsI only minor improvements in word error rate

I word graphs from recognition for MMI training[Valtchev+ 1997]:

I large vocabulary, 64k wordsI efficient implementationI 5-10% relative improvement in word error rate

OutlineIntroduction

Convex Optimization

Conclusions

Comparative Experimental Results

WER [%]SieTill WSJ 5k EPPS English Mandarin BN/BC

Crit. Test Dev Evl Dev06 Evl06 Evl07 Dev07 Evl06

ML 1.81 4.55 3.74 14.4 10.8 12.0 15.1 21.9MMI 1.79 4.07 3.53 13.8 11.0 12.0 14.4 20.8MCE 1.69 4.02 3.47 13.8 11.0 11.9MWE 3.98 3.44MPE 4.17 3.62 13.4 10.2 11.5 14.2 20.6

I SieTill [Schluter 2000]

I WSJ 5k [Macherey 2010]

I EPPS/broadcasts [Heigold 2010]

Outline

Introduction

Training Criteria

Parameter OptimizationMotivationGradient descentRpropFormal gradient of MMIFormal gradient of MPE

Convex Optimization

Conclusions

Motivation

Goal: optimization method for discriminative training criteria F (θ) w.r.t.Goal: set of parameters θ which provides reasonable convergence.

I Various approaches, e.g.:I extended Baum-Welch (EBW) [Normandin 1991]I gradient descent, study: e.g. [Valtchev 1995]I MMI with log-linear models: generalized iterative scaling (GIS)I generalization of GIS to log-linear models with hidden variables

and further criteria like MPE and MCE [Heigold+ 2008a]

I Problems:I robust setting of step sizes/iteration constants (EBW and

gradient descent),I convergence speed (especially GIS).

Extended Baum-Welch

I Motivated by a growth transformation [Gopalakrishnan+ 1991]

I Widely used for discriminative training of Gaussian mixtureHMMs, e.g. [Normandin 1991, Valtchev+ 1997,Schluter 2000, Woodland & Povey 2002]

I Highly optimized heuristics for finding right order ofmagnitude for iteration constants.

I Training of Gaussian mixture HMMs: require positivevariances to obtain estimate for iteration constants.

Outline

Introduction

Training Criteria

Convex Optimization

Conclusions

Gradient descent

Follow gradient to optimize parameter:

θ = θ + γ∇θFθ

Step sizes:

I heuristic, e.g. for MCE [Chou+ 1992])

I by comparison to EBW [Schluter 2000]

Convergence:

I local optimum

I better convergence: general purpose approaches,e.g. Qprop, Rprop, or L-BFGS, for experimental comparisonssee [McDermott & Katagiri 2005, McDermott+ 2007,Gunawardana+ 2005, Mahajan+ 2006]

Outline

Introduction

Training Criteria

Convex Optimization

Conclusions

Rprop [Riedmiller & Braun 1993]

General purpose gradient based optimization:

I assume iteration n

I parameter update:

θ(n+1)i = θ

(n)i + γ

(n)i sign

(∂F (θ(n))

∂θi

)I update of step sizes γ

(n)i :

γ(n+1)i =

min{γ(n)

i · η+, γmax} if ∂F (θ(n))∂θi

· ∂F (θ(n−1))∂θi

max{γ(n)i · η−, γmin} if ∂F (θ(n))

∂θi· ∂F (θ(n−1))

∂θi< 0

γ(n)i otherwise

I η+ ∈ (1,∞), η− ∈ (0, 1)

Outline

Introduction

Training Criteria

Convex Optimization

Conclusions

Formal gradient of MMII notation:

I W : word sequence w1, . . . ,wN

I r : index of training segment/utterance given by (Xr ,Wr )I Xr : acoustic observation vector sequence xr1, . . . , xrT

I Wr : reference/spoken word sequence wr1, . . . ,wrN

I sT1 : HMM state sequence s1, . . . , sT

I MMI training criterion:

FMMI(θ) =∑

(p(Wr )pθ(Xr |Wr )∑

p(W )pθ(Xr |W )

(log p(Wr )pθ(Xr |Wr )− log

p(W )pθ(Xr |W )

)I acoustic model (HMM):

pθ(Xr ,W ) =∑sTr

Tr∏t=1

p(st |st−1)pθ(xrt |st)

Derivative of MMI w.r.t. Parameter

Gradient of MMI criterion:

∇θFMMI(θ) =∑

(∇θ log pθ(Xr |Wr )

p(W )pθ(Xr |W )∇θ log pθ(Xr |W )∑W ′

p(W ′)pθ(Xr |W ′)

For efficient evaluation, consider derivative of acoustic model,∇θ log pθ(Xr |W ).

Derivative of MMI w.r.t. ParameterDerivative of acoustic model:

∇θ log pθ(Xr ,W ) = ∇θ log∑

sTr1 :W

Tr∏t=1

pθ(xrt |st)p(st |st−1)

=Tr∑t=1

∑sTr

(∇θ log pθ(xrt |st)

)·∏Tr

t′=1 pθ(xrt′ |st′)p(st′ |st′−1)∑σTr

∏Trτ=1 pθ(xrτ |sτ )p(sτ |sτ−1)

=Tr∑t=1

(∇θ log pθ(xrt |s)

∑sTr

1 :st=spθ(Xr , s

Tr1 |W )

pθ(Xr |W )

=Tr∑t=1

γrt(s|W ) · ∇θ log pθ(xrt |s)

with the word sequence conditioned state posterior (occupancy):

γrt(s|W ) =

∑sTr

1 :st=spθ(Xr , s

Tr1 |W )

pθ(Xr |W )= pθ,t(s|Xr ,W )

Derivative of MMI w.r.t. Parameterresubstitute derivative of acoustic model into derivative of MMI criterion:

∇θFMMI(θ) =∑

Tr∑t=1

·(γrt(s|Wr )−

p(W )pθ(Xr |W )γrt(s|W )∑W ′

Tr∑t=1

)·(γrt(s|Wr )− γrt(s)

)with the general state posterior (occupancy):

γrt(s) =

p(W )pθ(Xr |W )γrt(s|W )∑W ′

p(W ′)pθ(Xr |W ′)= pθ,t(s|Xr )

Efficient Calculation of State OccupanciesIn general:

I efficient calculation of spoken word sequence conditional stateoccupancy γrt(s|Wr ): forward-backward state probabilities ontrellis of word sequence

I efficient calculation of general state occupancy γrt(s):forward-backward probabilities on trellis of word lattice

Viterbi approximation:I γrt(s|W ) = δs,srt(W ) with forced alignment

Sr (W ) = sr1(W ), . . . , srTr (W ) of spoken word sequenceI assume a (word) lattice Mr for utterance r , with edges ω

representing a word w(ω) (in context) with start time ts(ω)and end time te(ω), and a corresponding forced alignmentstets (ω). An edge sequence W ∈Mr then corresponds to the

word sequence W (W). Consequently, the language model andacoustic model can also be defined for an edge sequence,which then might specify word boundaries, phonetic andlanguage model context.

Word Posterior ProbabilitiesFor the general state occupancy in Viterbi approximation we obtain:

γrt(s) =

p(W )pθ(Xr |W )δs,srt(W )

pθ(Xr )

=∑ω

δs,srt(ω)

∑W:ω∈W

p(W)pθ(Xr |W)

pθ(Xr )

=∑ω

δs,srt(ω)p(ω|Xr )

with the edge (or word in context) posterior

p(ω|Xr ) =∑W:ω∈W

p(W)pθ(Xr |W)

pθ(Xr )

A forward-backward algorithm is used to efficiently compute edge(word in context) posterior probabilities using word lattices.

Outline

Introduction

Training Criteria

Convex Optimization

Conclusions

Formal gradient of MPE

I A(W ,Wr ): accuracy (negated error) between string W and Wr

I example (MPE): approximate phone accuracy[Povey & Woodland 2002]

I expectation of accuracy:

Eθ[A(·,Wr )] :=∑W

A(W ,Wr ) · p(W )pθ(Xr |W )∑W ′

I MPE training criterion:

FMPE(θ) =∑

Eθ[A(·,Wr )]

Derivative of MPE w.r.t. Parameter

Derivative of MPE criterion:

∇θpθ(Xr |W ) = pθ(Xr |W ) ·(∇θ log pθ(Xr |W )

)∇θFMPE(θ) =

(A(W ,Wr )− Eθ[A(·,Wr )]

)·(∇θ log pθ(Xr |W )

· p(W )pθ(Xr |W )∑W ′

For efficient evaluation, consider derivative of acoustic model:

∇θ log pθ(Xr |W ) =Tr∑t=1

∑sTr

1 :st=spθ(Xr , s

Tr1 |W )

pθ(Xr |W )

Derivative of MPE w.r.t. Parameter

resubstitute derivative of acoustic model into derivative of MPE criterion:

∇θFMPE(θ) =∑

Tr∑t=1

)· γrt(s)

with the general state accuracy:

γrt(s) =∑W

(A(W ,Wr )−Eθ[A(·,Wr )]

∑sTr

1 :st=s

p(W )pθ(Xr , sTr1 |W )

∑W ′

which can be computed efficiently, similar to the case of generalstate occupancies.

Efficient Calculation of State Accuracy

In general:

I assumption: A(W ,Wr ) =Tr∑t=1

A(srt(W ), srt(Wr ))

I example: approximate phone accuracy[Povey & Woodland 2002]

I efficient calculation of general state accuracy γrt(s):forward-backward accuracies on trellis of word lattice[Povey & Woodland 2002]

Word Posterior AccuraciesFor the general state accuracy we in Viterbi approximation obtain:

γrt(s) =

(A(W ,Wr − Eθ[A(·,Wr )])

)· p(W )pθ(Xr |W )δs,srt(W )

pθ(Xr )

=∑ω

δs,srt(ω)

∑W :ω∈W

(A(W ,Wr − Eθ[A(·,Wr )])

)· p(W )pθ(Xr |W )

pθ(Xr )

=∑ω

δs,srt(ω)p(ω|Xr )

with the edge (or word in context) posterior accuracies

p(ω|Xr ) =∑

W :ω∈W

(A(W ,Wr − Eθ[A(·,Wr )])

)· p(W )pθ(Xr |W )

pθ(Xr )

Later, an efficient way of computing edge (word in context)posterior accuracies using word lattices will be presented.

Outline

Introduction

Training Criteria

Efficient Calculation of Discriminative StatisticsForward/Backward Probabilities on Word LatticesGeneralized FB Probabilities on WFSTs

Convex Optimization

Conclusions

Forward/Backward Probabilities on Word LatticesI Let ωs(W) and ωe(W) be the first and last edge of a

continuous edge sequence W on a word lattice.I Assume that the lattice fully encodes the language model context:

p(W (W)) = p(W = ωN1 ) =

N∏n=1

p(ωn|ωn−1)

Let ωri and ωrf be the initial and final edges of a word lattice forutterance r . Then define the following forward (Φ) and backward(Ψ) probabilities on initial and final partial edge sequences on theword lattice respectively:

Φ(ω) =∑

W:ωs(W)=ωri

ωe(W)=ω

p(W)pθ(xrte(W)1 |W)

Ψ(ω) =∑

W:ωs(W)=ω

ωe(W)=ωrf

p(W)pθ(xrTr

ts(W)|W)

Forward/Backward Probabilities on Word Lattices

For the forward probability a recursion formulae can be derived byseparating the last edge from the edge sequence in the summationand ≺ denoting direct predecessor edges:

Φ(ω) =∑

W:ωs(W)=ωri

ωe(W)=ω

p(W)pθ(xrte(W)1 |W)

=∑ω′≺ω

∑W ′:ωs(W ′)=ωri

ωe(W ′)=ω′

p(W ′)p(ω|ω′)pθ(xrte(W ′)1 |W ′)pθ(xr

te(ω)ts(ω)|ω)

=∑ω′≺ω

Φ(ω′)p(ω|ω′)pθ(xrte(ω)ts(ω)|ω).

Using this recursion formula, the forward probabilities can becalculated efficiently on word lattices.

Similar to the forward probabilities, a recursion formula can bederived for efficient calculation of the backward probabilities and �denoting direct successor edges:

Ψ(ω) =∑

W:ωs(W)�ωωe(W)=ωrf

p(W)pθ(xrTr

ts(ω)|ωW)

=∑ω′�ω

∑W ′:ωs(W ′)�ω′ωe(W ′)=ωrf

p(ω′|ω)p(W ′)pθ(xrte(ω)ts(ω)|ω)pθ(xr

ts(ω′)|ω′W ′)

=∑ω′�ω

pθ(xrte(ω)ts(ω)|ω)p(ω′|ω)Ψ(ω′)

Using the forward and backward probabilities, the edge/wordposterior on a word lattice can be written as

p(ω|Xr ) =

Φ(ω)∑ω′�ω

p(ω′|ω)Ψ(ω′)

Φ(ωrf )

with pθ(Xr ) = Φ(ωrf ) = Ψ(ωri ).

Word posterior probabilities follow naturally from MPE and similardiscriminative training criteria. They also are the basis forconfidence measures, which are used for unsupervised training,adaptation, or dialog systems. They are also part of approximateapproaches to Bayes’ decision rule with word error cost, likeconfusion networks [Mangu+ 1999], or minimum frame worderror [Wessel+ 2001a, Hoffmeister+ 2006].

Outline

Introduction

Training Criteria

Efficient Calculation of Discriminative StatisticsForward/Backward Probabilities on Word LatticesGeneralized FB Probabilities on WFSTs

Convex Optimization

Conclusions

FB Probabilities: Generalization to WFSTs

Replace word lattice with WFST

I edge label: word with pronunciation

I weight of edge ω: p ← p(ω|ω′) · pθ(xrte(ω)ts(ω)|ω)

I semiring: substitute arithmetic operations (multiplication,addition, inversion) with operations of probability semiring

Semiring IK p ⊕ p′ p ⊗ p′ 0 1 inv(p)

probability IR+ p + p′ p · p′ 0 1 1p

Example WFST from SieTill, Wr =”drei sechs neun” (in red)

7 8 90

neun /9//−876null /0//−424drei /3//−384

[SIL] /si//−2618

sechs /6//−1014

drei /3//−1013

drei /3//−909

zwei /2//−556

neun /9//−480

[SIL] /si//−706

[SIL] /si//−274

[SIL] /si//−632

[SIL] /si//−719

sieben /7//67

sieben /7//−568

fünf /5//−437

FB Probabilities: Generalization to WFSTs

Forward probabilities (pre(ω) ∈ W such that pre(ω) ≺ ω)

Φ(ω) :=⊕

W:ωs(W)=ωri

ωe(W)=ω

⊗ω∈W

p(ω|pre(ω))⊗ pθ(xrte(ω)ts(ω)|ω)

=⊕ω′≺ω

Φ(ω′)⊗ p(ω|ω′)⊗ pθ(xrte(ω)ts(ω)|ω)

Backward probabilities: similar

Using the forward and backward probabilities, the edge posterioron a WFST Xr can be written as

p(ω|Xr ) = Φ(ω)⊗( ⊕ω′�ω

p(ω′|ω)⊗Ψ(ω′)

)⊗ inv(Φ(ωrf ))

Expectation Semiring

vector weight (p, v) of edge ω with

I p ← p(ω|ω′) · pθ(xrte(ω)ts(ω)|ω)

I v ← A(ω) · pI accuracy of edge ω such that

⊗ω∈W A(ω) = A(W,Wr )

I approximate phone accuracy [Povey & Woodland 2002] can bedecomposed in this way

I such a decomposition not possible in general

expectation semiring [Eisner 2001]:vector semiring whose first component is a probability semiring

Semiring IK (p, v)⊕ (p′, v′) (p, v)⊗ (p′, v′) 0 1 inv(p, v)

expectation IR+ × IR (p + p′, v + v′) (p · p′, p · v′ + p′ · v) (0, 0) (1, 0)

„1p,− v

Edge Posteriors & Expectation Semiringprobability semiring

I word posterior probabilities (see MMI derivative) identical toedge posteriors using probability semiring

p(ω|Xr ) = pprobability(ω|Xr )

I intuitive and classical result [Rabiner 1989]

expectation semiringI word posterior accuracies (see MPE derivative) identical to

v -component of edge posteriors using expectationsemiring [Heigold+ 2008b]

p(ω|Xr ) = pexpectation,v (ω|Xr )

I also use this identity to efficiently calculateI derivative of unified training criterionI covariance between two random additive variables (related to

MPE derivative)

Outline

Introduction

Training Criteria

Generalisation to Log-Linear ModelingTransformation: Gaussian into Log-Linear ModelTransformation from Log-Linear Model into Gaussian

Convex Optimization

Conclusions

Definition of Models

assume feature vector x ∈ IRD and class c ∈ {1, . . . ,C}

Gaussian modelN (x |µc ,Σc) with

I means µc ∈ IRD

I positive-definite covariancematrices Σc ∈ IRD×D

induces posterior pθ(c|x)

p(c)N (x |µc ,Σc)∑c ′

p(c ′)N (x |µc ′ ,Σc ′)

I include priors p(c) ∈ IR+

Log-linear model with uncon-strained parameters

I λc0 ∈ IRI λc1 ∈ IRD

I λc2 ∈ IRD×D

exp(x>λc2x + λ>c1x + λc0

)∑c ′

exp(x>λc ′2x + λ>c ′1x + λc ′0

Transformation: Gaussian into Log-Linear Model

Comparison of terms quadratic, linear, and constant inobservations x leads to the transformation rules[Saul & Lee 2002, Gunawardana+ 2005]:

1. λc2 = −12 Σ−1

2. λc1 = Σ−1c µc

3. λc0 = −12

(µ>c Σ−1

c µc + log |2πΣc |)

+ log p(c)

Outline

Introduction

Training Criteria

Generalisation to Log-Linear ModelingTransformation: Gaussian into Log-Linear ModelTransformation from Log-Linear Model into Gaussian

Convex Optimization

Conclusions

Transformation: Log-Linear into Gaussian ModelI invert transformation from Gaussian to log-linear model

1. Σc = −12λ−1c2

2. µc = Σ−1c λc1

3. p(c) = exp(λc0 + 1

(µ>c Σ−1

c µc + log |2πΣc |))

I problem: parameter constraints not satisfied in generalI covariance matrices Σc must be positive-definiteI priors p(c) must be normalized

I solution: model parameters for posterior are ambiguouse.g. for ∆λ2 ∈ IRD×D ,∆λ0 ∈ IR

exp(x>(λc2 + ∆λ2)x + λ>c1x + (λc0 + ∆λ0)

)∑c ′

exp(x>(λc ′2 + ∆λ2)x + λ>c ′1x + (λc ′0 + ∆λ0)

exp(x>λc2x + λ>c1x + λc0

)∑c ′

exp(x>λc ′2x + λ>c ′1x + λc ′0

)Schluter: Introduction to Discriminative Training in Speech Recognition 63 January 14, 2010

Transformation: Log-Linear into Gaussian Model

invert transformation rules for transformed log-linear model

1. Σc = −12 (λc2 + ∆λ2)−1

2. µc = Σ−1c λc1

3. p(c) = exp((λc0 + ∆λ0) + 1

(µ>c Σ−1

use additional degrees of freedom to impose parameter constraints

I choose ∆λ2 ∈ IRD×D such that λc2 + ∆λ2 arenegative-definite

I choose ∆λ0 such that p(c) is normalized, i.e.,

∆λ0 := − log∑

(λc0 +

(µ>c Σ−1

Outline

Introduction

Training Criteria

Convex OptimizationMotivationConvex Training Criteria in Speech RecognitionExperimental Results

Conclusions

Motivation

Conventional approach:

I depends on initialization and choice of optimization algorithm

I spurious local optima (non-convex training criterion)

I many heuristics required

I i.e., involves much engineering work

“Fool-proof” approach:

I unique optimum (independent of initialization)

I accessibility of global optimum (convex training criterion)

I joint optimization of all model parameters, no parameters tobe tuned

Outline

Introduction

Training Criteria

Conclusions

Assumptions

Assumptions to cast HCRF into CRF

I log-linear parameterization, e.g.p(x |s) = exp

(x>λs2x + λ>s1x + λs0

)and p(s|s ′) = exp(αs′s)

I MMI-like training criterion

I alignment represents spoken sequence

I alignment of spoken sequence known and kept fixed

I use single densities with augmented features instead of mixtures

I exact normalization constant

Lattice-Based MMI

Flattice(λ) =∑

∑sTr

1 ∈Nr

pλ(xTr1 , sTr

∑sTr

1 ∈Dr

pλ(xTr1 , sTr

I numerator word lattice Nr : state sequences sT1 representing

correct hypothesisI denominator word lattice Dr : correct and competing state

sequences, use word pair approximation and pruningI non-convex

word lattice:

fünfeins

sechs[SIL]

[SIL]acht

0 44152 277

Fool-Proof MMI

Ffool(λ) =∑

pλ(xTr1 , sTr

1 )∑sTr

1 ∈Sr

pλ(xTr1 , sTr

I consider only best state sequence sT1 in numerator, kept fixed

I sum over full state sequence network in denominatorI convex

ε ε:

ε106:

51:ε51:ε

51:εε62:

ε7:ε7:

ε ε:

7:acht

18:acht

1918:acht

18:acht

7:acht

18:acht

7:acht

18:acht

Part of (pruned) HMM state network

Frame-Based MMI

Fframe(λ) =∑

Tr∑t=1

pλ(xt , st)S∑

pλ(xt , s)

I frame discrimination, cf. hybrid approach

I assume alignment for numerator sT1 , kept fixed

I summation over all HMM states s ∈ {1, . . . ,S} indenominator

I convex

Refinements to MMI

These refinements do not break convexity:

I `2-regularization

I margin term

Outline

Introduction

Training Criteria

Conclusions

Initialization

Analyze effect of initialparameters on training.

I vary initialization fordifferent trainingcriteria

I experiments: digitstrings (SieTill,German, telephone)

0 50 100 150 200 250 300

iteration index

frame-based M-MMIlattice-based M-MMI

fool-proof M-MMIfrom scratch

from MLfrom frame

0 50 100 150 200 250 300

iteration index

frame-based M-MMIlattice-based M-MMI

fool-proof M-MMIfrom scratch

from MLfrom frame

Read Speech (WSJ)

I 5k-vocabulary, trigram language model

I phone-based HMMs, 1,500 CART-tied triphones

I audio data: 15h (training), 0.4h (test)

I log-linear model with kernel-like features f (x)I first (fd(x) = xd) and second (fdd′(x) = xd · xd′) order featuresI cluster features: assume GMM of marginal distribution,

p(x) =∑

l p(x , l)

fl(x) =

{p(l |x) if p(l |x) ≥ threshold

0 otherwise

I starting from scratch (model) and linear segmentation

I frame-based MMI, with re-alignments

I details: [Wiesler+ 2009]

Read Speech (WSJ)

Feature setup WER [%]

First order features, monophones 22.7+second order features 10.3+210 cluster features + temporal context of size 9 6.2+1,500 CART-tied HMM states (triphones) 3.9+realignment 3.6GHMM (ML) 3.6

(MMI) 3.0

OutlineIntroduction

Training Criteria

Convex Optimization

Incorporation of Margin ConceptMotivationSupport Vector Machines (Hinge Loss)Smooth Approximation to SVM: Margin-MMISupport Vector Machines (Margin Error)Smooth Approximation to SVM: Margin-MPEExperimental Evaluation of Margin

Conclusions

MotivationGoal: incorporation of margin term into conventional training criteria

I replace likelihoods p(W )p(X |W ) with margin-likelihoodsp(W )p(X |W ) exp(−ρA(W ,Wr ))

I A(W ,Wr ): accuracy between hypothesis W and reference Wr

I interpretation (boosting):emphasize incorrect hypotheses by up-weighting

I interpretation (large margin): next slides

Margin

low complexity taskdifferent

loss functions

convergence local optima

different parameterization optimization

Margin in trainingis promising.

Individual contribution of margin in LVCSR training?

OutlineIntroduction

Training Criteria

Convex Optimization

Conclusions

Support Vector Machines (Hinge Loss)

Optimization problem for SVMs

SVM(λ) = −C

2‖λ‖2 −

R∑r=1

l(Wr , dr ; ρ)

I feature functions f (X ,W ), model parameters λ

I distance drW := λ>(f (Xr ,Wr )− f (Xr ,W ))

I hinge loss function l (hinge)(Wr , dr ; ρ) :=maxW 6=Wr {max {−drW + ρ(A(Wr ,Wr )− A(W ,Wr )), 0}}

I `2-regularization with constant C > 0

I [Altun+ 2003, Taskar+ 2003]

OutlineIntroduction

Training Criteria

Convex Optimization

Conclusions

Smooth Approximation to SVM: Margin-MMI

Margin-based/modified MMI (M-MMI)

FM-MMI,γ(λ) =− C

2‖λ‖2

(exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑W exp(γ(λ>f (Xr ,W )− ρA(W ,Wr )))

Lemma: FM-MMI,γγ→∞→ SVMhinge (pointwise convergence).

I [Heigold+ 2008b]

∆A(W ,Wr ) := A(Wr ,Wr )− A(W ,Wr )

exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑W

exp(γ(λ>f (Xr ,W )− ρA(W ,Wr )))

1 +∑

W 6=Wr

exp(γ(−drW + ρ∆A(W ,Wr )))

γ→∞→

maxW 6=Wr

{−drW + ρ∆A(W ,Wr )} if ∃W 6= Wr : drW < ρ∆A(W ,Wr )

0 otherwise

= maxW 6=Wr

{max{−drW + ρ∆A(W ,Wr ), 0}}

=: l (hinge)(Wr , dr ; ρ).

OutlineIntroduction

Training Criteria

Convex Optimization

Conclusions

Support Vector Machines (Margin Error)

Optimization problem for SVMs

SVM(λ) = −C

2‖λ‖2 −

R∑r=1

l(Wr , dr ; ρ)

I feature functions f (X ,W ), model parameters λ

I distance drW := λ>(f (Xr ,Wr )− f (Xr ,W ))

I margin error loss functionl (error)(Wr , dr ; ρ) := E

(A(arg minW [drW + ρA(W ,Wr )],Wr )

)I `2-regularization with constant C > 0

I [Heigold+ 2008b]

OutlineIntroduction

Training Criteria

Convex Optimization

Conclusions

Smooth Approximation to SVM: Margin-MPE

Margin-based/modified MPE (M-MPE)

FM-MPE,γ(λ) = −C

2‖λ‖2

E (W ,Wr )

(exp(γ(λ>f (Xr ,Wr )− ρA(Wr ,Wr )))∑

V exp(γ(λ>f (Xr ,V )− ρA(V ,Wr )))

Lemma: FM-MPE,γγ→∞→ SVMerror .

I [Heigold+ 2008b]

OutlineIntroduction

Training Criteria

Convex Optimization

Conclusions

Experimental Evaluation of Margin

Digit strings (SieTill, German, telephone)

dns/state feature orders # param, criterion WER [%]

1 first 11k ML 3.8MMI 2.9M-MMI 2.7

64 first 690k ML 1.8MMI 1.8M-MMI 1.6

1 first, second, 1,409k Frame 1.8and third MMI 1.7

M-MMI 1.5

European parliament plenary sessions in English (EPPS) andMandarin broadcasts

WER [%]EPPS En Mandarin BN/BC

Criterion 90h 230h 1500h

ML 12.0 21.9 17.9

MMI 20.8M-MMI 20.6

MPE 11.5 20.6 16.5M-MPE 11.3 20.3 16.3

Handwriting Recognition (IFN/ENIT)

I isolated town names, handwritten

I choose slice features to use 1D HMM

I details: see [Dreuw+ 2009]

WER [%]Criterion abc-d abd-c acd-b bcd-a abcd-e

ML 7.8 8.7 7.8 8.7 16.8

MMI 7.4 8.2 7.6 8.4 16.4M-MMI 6.1 6.8 6.1 7.0 15.4

Outline

Introduction

Training Criteria

Convex Optimization

ConclusionsEffective Discriminative Training

Effective Discriminative TrainingI Discriminative Criteria

I fit decision rule: minimize training errorI limit overfitting: include regularization and margin to exploit

limit overfitting: remaining degrees of freedom of the parameters

I Optimization MethodsI general purpose methods give robust estimatesI in convex case gradient descent still faster than

growth transform (GIS)I Log-Linear Modeling

I convex (w/o hidden variables)I covers Gaussians completely, w/o constraints on e.g. varianceI opens modeling up to arbitrary featuresI initialization: from scratch or from Gaussians

I Estimation of StatisticsI efficiency: use word lattice to represent competing word sequencesI implementation: generic approach using WFSTs,

implementation: covers class of criteria

limit overfitting: remaining degrees of freedom of the parametersI Optimization Methods

I general purpose methods give robust estimatesI in convex case gradient descent still faster than

growth transform (GIS)

I Log-Linear ModelingI convex (w/o hidden variables)I covers Gaussians completely, w/o constraints on e.g. varianceI opens modeling up to arbitrary featuresI initialization: from scratch or from Gaussians

Thanks for your attention!

Outline

Introduction

Training Criteria

Convex Optimization

Conclusions

AnnexReferencesSpeech Tasks: Corpus Statistics & SetupsHandwriting Recognition Tasks

ReferencesY. Altun, I. Tsochantaridis, and T. Hofmann, “Hidden Markov support vector machines,” in International

Conference on Machine Learning (ICML) 2003, Washington, DC, USA, 2003.

L. R. Bahl, P. F. Brown, P. V. de Souza, R. L. Mercer. “Maximum Mutual Information Estimation of

Hidden Markov Model Parameters for Speech Recognition,”Proc. 1986 Int. Conf. on Acoustics, Speech and Signal Processing, Vol. 1, pp. 49-52, Tokyo, Japan, May1986.

P. F. Brown. The Acoustic-Modeling Problem in Automatic Speech Recognition, Ph.D. thesis, Department

of Computer Science, Carnegie Mellon University, Pittsburgh, PA, May 1987.

W. Chou, B.-H. Juang, and C.-H. Lee, “Segmental GPD training of HMM based speech recognizer,” in

IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 1992, San Francisco,CA, USA, March 1992, pp. 473-476.

Y.-L. Chow: “Maximum Mutual Information Estimation of HMM Parameters for Continuous Speech

Recognition using the N-best Algorithm,” inProc. Int. Conf. on Acoustics, Speech and Signal Processing(ICASSP), pp. 701–704, Albuquerque, NM, April 1990.

P. Dreuw, G. Heigold, and H. Ney, “Confidence-based discriminative training for model adaptation in offline

Arabic handwriting recognition,” in International Conference on Document Analysis and Recognition(ICDAR), Barcelona, Spain, July 2009.

J. Eisner, “Expectation semirings: Flexible EM for finite-state transducers,” in International Workshop on

Finite-State Methods and Natural Language Processing (FSMNLP), Helsinki, Finland, August 2001.

P. S. Gopalakrishnan, D. Kanevsky, A. Nadas, D. Nahamoo. “An Inequality for Rational Functions with

Applications to Some Statistical Estimation Problems,” IEEE Transactions on Information Theory, Vol. 37,Nr. 1, pp. 107-113, January 1991.

A. Gunawardana, M. Mahajan, A. Acero, and J. Platt, “Hidden conditional random fields for phone

classification,” in Interspeech, pp. 117 120, Lisbon, Portugal, Sept. 2005.

G. Heigold, W. Macherey, R. Schluter, and H. Ney: ”Minimum Exact Word Error Training,” in Proc. IEEE

Automatic Speech Recognition and Understanding Workshop (ASRU), pages 186–190, San Juan, PuertoRico, November 2005.

G. Heigold, T. Deselaers, R. Schluter, H. Ney: “GIS-like Estimation of Log-Linear Models with Hidden

Variables,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP),pages 4045–4048, Las Vegas, NV, USA, April 2008.

G. Heigold, T. Deselaers, R. Schluter, H. Ney, “Modified MMI/MPE: A direct evaluation of the margin in

speech recognition,” in International Conference on Machine Learning (ICML), pp. 384-391, Helsinki,Finland, July 2008.

G. Heigold: A Log-Linear Modeling Framework for Speech Recognition, Doctoral Thesis to be submitted,

RWTH Aachen University, Aachen, Germany, 2010.

B. Hoffmeister, T. Klein, R. Schluter, and H. Ney: “Frame Based System Combination and a Comparison

with Weighted ROVER and CNC,” in Proc. Interspeech, pages 537–540, Pittsburgh, PA, USA, September2006.

B.-H. Juang and S. Katagiri, “Discriminative learning for minimum error classification,” IEEE Transactions

on Signal Processing, vol. 40, no. 12, pp. 3043-3054, 1992.

D. Kanevsky: “Extended Baum Welch transformations for general functions,” in Proc. IEEE International

Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 821-824, Montreal, Quebec,Canada, May 2004.

W. Macherey, L. Haferkamp, R. Schluter, H. Ney: “Investigations on Error Minimizing Training Criteria for

Discriminative Training in Automatic Speech Recognition,” in Proc. European Conference on SpeechCommunication and Technology (Interspeech), Lisbon, September 2005.

W. Macherey, “Discriminative training and acoustic modeling for automatic speech recognition,” Ph.D.

thesis to be submitted, RWTH Aachen University, 2010.

M. Mahajan, A. Gunawardana, A. Acero: “Training algorithms for hidden conditional random fields,” in

Proc IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Toulouse,France, May 2006.

L. Mangu, E. Brill, A. Stolcke: “Finding Consensus Among Words: Lattice-Based Word Error

Minimization,” Proc. European Conference on Speech Communication and Technology (EUROSPEECH),pp. 495–498, Budapest, Hungary, Sept. 1999.

E. McDermott, S. Katagiri: “Minimum Classification Error for Large Scale Speech Recognition Tasks using

Weighted Finite State Transducers,” in Proc. IEEE International Conference on Acoustics, Speech, andSignal Processing (ICASSP), Philadelphia, PA, USA, April 2005.

E. McDermott, T. Hazen, J.L. Roux, A. Nakamura, S. Katagiri: “Discriminative training for large

vocabulary speech recognition using Minimum Classification Error,” in Proc. IEEE Transactions on Audio,Speech and Language Processing (ICASSP), Vol. 15, No. 1, pp. 203–223, April 2007.

Y. Normandin, “Hidden Markov Models, Maximum Mutual Information, and the Speech Recognition

Problem,” Ph.D. thesis, McGill University, Montreal, Canada, 1991.

D. Povey and P. C. Woodland, “Minimum phone error and I- smoothing for improved discriminative

training,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 2002,Orlando, FL, May 2002, vol. 1, pp. 105–108.

L.R. Rabiner, “A tutorial on hidden Markov models and selected applications in speech recognition,” in

Proc. of the IEEE, vol. 77, no. 2, pp. 257–286, 1989.

M. Riedmiller and H. Braun, “A direct adaptive method for faster backpropagation learning: The Rprop

algorithm,” in IEEE International Conference on Neural Networks (ICNN) 1993, San Francisco, CA, USA,1993, pp. 586 591.

L. Saul and D. Lee, “Multiplicative updates for classification by mixture models,” in T.G. Dietterich, S.

Becker, and Z. Ghahramani, editor, Advances in Neural Information Processing Systems (NIPS). MIT Press,2002.

R. Schluter, B. Muller, F. Wessel, and H. Ney: “Interdependence of Language Models and Discriminative

Training,” in Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Vol. 1,pages 119–122, Keystone, CO, December 1999.

R. Schluter: Investigations on Discriminative Trainings Criteria, Doctoral Thesis, RWTH Aachen University,

Aachen, Germany, Sept. 2000.

R. Schluter, H. Ney: ”Model-based MCE Bound to the True Bayes’ Error,” IEEE Signal Processing Letters,

Vol. 8, No. 5, pages 131–133, May 2001.

B. Taskar, C. Guestrin, and D. Koller, “Max-margin Markov networks,” in Advances in Neural Information

Processing Systems (NIPS) 2003, 2003.

V. Valtchev: Discriminative Methods in HMM-based Speech Recognition, Ph.D. thesis, St. John’s College,

University of Cambridge, Cambridge, March 1995.

V. Valtchev, J. J. Odell, P. C. Woodland, S. J. Young. “MMIE Training of Large Vocabulary Recognition

Systems,” Speech Communication, Vol. 22, No. 4, pp. 303-314, September 1997.

F. Wessel, K. Macherey, and R. Schluter, “Using word probabilities as confidence measures,” in IEEE

International Conference on Acoustics, Speech, and Signal Processing (ICASSP) 1998, Seattle, WA, USA,May 1998, pp. 225-228.

F. Wessel, R. Schluter, and H. Ney: “Explicit Word Error Minimization using Word Hypothesis Posterior

Probabilities,” in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing(ICASSP), pages 33–36, Salt Lake City, Utah, May 2001.

F. Wessel, R. Schluter, H. Ney: “Explicit Word Error Minimization using Word Hypothesis Posterior

Probabilities,” Proc. IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 33–36,Salt Lake City, Utah, May 2001.

S. Wiesler, et al., “Investigations on features for log-linear acoustic models in continuous speech

recognition,” in IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), Merano, Italy,Dec. 2009.

P. C. Woodland and D. Povey, “Large scale discriminative training for speech recognition,” in Automatic

Speech Recognition (ASR) 2000, Paris, France, September 2000, pp. 7–16.

P.C. Woodland, D. Povey: “Large Scale Discriminative Training of Hidden Markov Models for Speech

Recognition.” Computer Speech and Language, Vol. 16, No. 1, pp. 2548, 2002.

Outline

Introduction

Training Criteria

Convex Optimization

Conclusions

Speech Tasks: Corpus Statistics & Setups

Identifier Description Train/Test [h]

SieTill German digit strings 11/11 (Test)EPPS En English European Parliament 92/2.9 (Evl07)

plenary speechBNBC Cn 230h Mandarin broadcasts 230/2.2 (Evl06)BNBC Cn 1500h Mandarin broadcasts 1,500/2.2 (Evl06)

Identifier #States/#Dns Features

SieTill 430/27k 25 LDA(MFCC)EPPS En 4,500/830k 45 LDA(MFCC+voicing)

+VTLN+SAT/CMLLRBNBC Cn 230h 4,500/1,100k 45 LDA(MFCC)+3 tones

+VTLN+SAT/CMLLRBNBC Cn 1500h 4,500/1,200k 45 SAT/CMLLR(PLP+voicing

+3 tones+32 NN)+VTLN

Outline

Introduction

Training Criteria

Convex Optimization

Conclusions

Handwriting Recognition Tasks

IFN/ENIT:

I isolated Tunisian town names

I 4 training folds + 1 additional fold for testing

I simple appearance-based image slice features

I each fold comprises approximately 500,000 frames

#Observations [k]Corpus Towns Frames

IFN/ENIT a 6.5 452b 6.7 459c 6.5 452d 6.7 451e 6.0 404

Introduction to Discriminative Training in Speech …...2010/01/14 · Schluter: Introduction to...

Documents