Conditional Random Fields and Direct Decoding for Speech and Language Processing

Slide 1

Conditional Random Fields and Direct Decoding for Speech and Language ProcessingEric Fosler-LussierThe Ohio State UniversityGeoff ZweigMicrosoftWhat we will coverTutorial introduces basics of direct probabilistic modelsWhat is a direct model, and how does it relate to speech and language processing?How do I train a direct model?How have direct models been used in speech and language processing?2OverviewPart 1: Background and TaxonomyGenerative vs. Direct modelsDescriptions of models for classification, sequence recognition (observed and hidden)BreakPart 2: Algorithms & Case StudiesTraining/decoding algorithmsCRF study using phonological features for ASRSegmental CRF study for ASRNLP case studies (if time)3Part 1: Background and TaxonomyA first thought experimentYoure observing a limousine is a diplomat inside?Can observe:Whether the car has flashing lightsWhether the car has flags

5The Diplomat problemWe have observed Boolean variables: lights and flagWe want to predict if car contains a diplomat

6A generative approach:Nave BayesGenerative approaches model observations as being generated by the underlying class we observe:Limos carrying diplomats have flags 50% of the timeLimos carrying diplomats have flashing lights 70%Limos not carrying diplomats: flags 5%, lights 30%NB: Compute posterior by Bayes rule

7A generative approach:Nave BayesGenerative approaches model observations as being generated by the underlying class we observe:Limos carrying diplomats have flags 50% of the timeLimos carrying diplomats have flashing lights 70%Limos not carrying diplomats: flags 5%, lights 30%NB: Compute posterior by Bayes ruleand then assume conditional independence

8A generative approach:Nave BayesNB: Compute posterior by Bayes ruleand then assume conditional independenceP(Lights, Flag) is a normalizing termCan replace this with normalization constant Z

9Graphical model for Nave Bayes

LightsDiplomatFlagP(Flag|Dmat)P(Lights|Dmat)P(Dmat)10Graphical model for Nave Bayes

LightsDiplomatFlagP(Flag|Dmat)P(Lights|Dmat)P(Dmat)Lights and Flag areconditionally independentgiven Diplomat11Correlated evidence inNave BayesConditional independence says given a value of Diplomat, Lights and Flag are independentConsider the case where lights are always flashing when the car has flagsEvidence gets double counted; NB is overconfidentMay not be a problem in practice problem dependent(HMMs have similar assumptions: observations are independent given HMM state sequence.)

12Reversing the arrows:Direct modelingP(Diplomat|Lights,Flag) can be directly modeledWe compute a probability distribution directly without Bayes ruleCan handle interactionsbetween Lights and Flagevidence

P(Lights) and P(Flag) do not need to be modeledLightsDiplomatFlagP(Dmat|Lights, Flag)13Direct vs. DiscriminativeIsnt this just discriminative training? (No.)Direct model: directly predict posterior of hidden variableDiscriminative training: adjust model parameters to {separate classes, improve posterior, minimize classification error,}

LightsDiplomatFlagP(Flag|Dmat)P(Lights|Dmat)P(Dmat)LightsDiplomatFlagP(Dmat|Lights, Flag)Generative modelDirect model14Direct vs. DiscriminativeGenerative models can be trained discriminatively

Direct models inherently try to discriminate between classesLightsDiplomatFlagP(Flag|Dmat)P(Lights|Dmat)P(Dmat)LightsDiplomatFlagP(Dmat|Lights, Flag)Models change to discriminate Diplomat betterDirect discriminative optimizationPros and cons of direct modelingPro:Often can allow modeling of interacting data featuresCan require fewer parameters because there is no observation modelObservations are usually treated as fixed and dont require a probabilistic modelCon:Typically slower to train Most training criteria have no closed-form solutions16A simple direct model:Maximum EntropyOur direct example didnt have a particular form for the probability P(Dmat|Lights, Flag)A maximum entropy model uses a log-linear combination of weighted features in probability modelLightsDiplomatFlagP(Dmat|Lights, Flag)

learned weightfeature of the data for class j17A simple direct model:Maximum EntropyDenominator of the equation is again normalization term (replace with Z)Question: what are fi,j and how does this correspond to our problem?LightsDiplomatFlagP(Dmat|Lights, Flag)

learned weightfeature of the data for class j18Diplomat Maximum EntropyHere are two features (fi,j) that we can use:f0,True=1 if car has a diplomat and has a flagf1,False=1 if car has no diplomat but has flashing lights(Could have complementary features as well but left out for simplification.)Example dataset with the following statisticsDiplomats occur in 50% of cars in datasetP(Flag=true|Diplomat=true) = 0.9 in datasetP(Flag=true|Diplomat=false) = 0.2 in datasetP(Lights=true|Diplomat=false) = 0.7 in datasetP(Lights=true|Diplomat=true) = 0.5 in dataset

19Diplomat Maximum EntropyThe MaxEnt formulation using these two features is:

where ltrue and lfalse are bias terms to adjust for frequency of labels.

Fix the bias terms to both be 1. What happens to probability of Diplomat on dataset as other lambdas vary?

f0,T=1 if car has a diplomat and has a flagf1,F=1 if car has no diplomat but has flashing lights

20Log probability of Diplomat over dataset as MaxEnt lambdas vary

21Finding optimal lambdas

Good news: conditional probability of dataset is convex for MaxEnt

Bad news: as number of features grows, finding maximum in so many dimensions can be slooow.Various gradient search or optimization techniques can be used (coming later).Same picture in 3-d:Conditional probability of dataset22MaxEnt-style models in practiceSeveral examples of MaxEnt models in speech & language processingWhole-sentence language models (Rosenfeld, Chen & Zhu, 2001)Predict probability of whole sentence given features over correlated features (word n-grams, class n-grams, )Good for rescoring hypotheses in speech, MT, etcMulti-layer perceptronsMLP can really be thought of as MaxEnt models with automatically learned feature functionsMLP gives local posterior classification of frameSequence recognition through Hybrid or Tandem MLP-HMMSoftmax-trained Single Layer Perceptron == MaxEnt model23MaxEnt-style models in practiceSeveral examples of MaxEnt models in speech & language processingFlat Direct Models for ASR (Heigold et al. 2009)Choose complete hypothesis from list (rather than a sequence of words)Doesnt have to match exact words (auto rental=rent-a-car)Good for large-scale list choice tasks, e.g. voice searchWhat do features look like?24Flat Direct Model Features:Decomposable featuresDecompose features F(W,X) = (W)(X) (W) is a feature of the wordse.g. The last word ends in sThe word Restaurant is present(X) is a feature of the acousticse.g. The distance to the Restaurant template is greater than 100The HMM for Washington is among the 10 likeliest(W)(X) is the conjunction; measures consistencye.g. The hypothesis ends is s and my s-at-the-end acoustic detector has fired 25GeneralizationPeople normally think of Maximum Entropy for classification among a predefined setBut F(W,X) = (W)(X) essentially measures consistency between W and XThese features are defined for arbitrary W. For example, Restaurants is present and my s-at-the-end detector has fired can be true for either Mexican Restaurants or Italian Restaurants

26Direct sequence modelingIn speech and language processing, usually want to operate over sequences, not single classificationsConsider a common generative sequence model the Hidden Markov Model relating states (S) to obs. (O)O1S1O2S2O3S3

P(Oi|Si)P(Si|Si-1)27Direct sequence modelingIn speech and language processing, usually want to operate over sequences, not single classificationsWhat happens if we change the direction of arrows of an HMM? A direct model of P(S|O).O1S1O2S2O3S3

P(Si|Si-1,Oi)28MEMMsIf a log linear term is used for P(Si|Si-1,Oi) then this is a Maximum Entropy Markov Model (MEMM)(Ratnaparkhi 1996, McCallum, Freitag & Pereira 2000)Like MaxEnt, we take features of the observations and learn a weighted modelO1S1O2S2O3S3

P(Si|Si-1,Oi)

29MEMMsUnlike HMMs, transitions between states can now depend on acoustics in MEMMsHowever, unlike HMM, MEMMs can ignore observationsIf P(Si=x|Si-1=y)=1, then P(Si=x|Si-1=y,Oi)=1 for all Oi (label bias)Problem in practice?O1S1O2S2O3S3P(Si|Si-1,Oi)30MEMMs in language processingOne prominent example in part-of-speech tagging is the Ratnaparkhi MaxEnt tagger (1996)Produce POS tags based on word history featuresReally an MEMM because it includes the previously assigned tags as part of its historyKuo and Gao (2003-6) developed Maximum Entropy Direct Models for ASRAgain, an MEMM, this time over speech framesFeatures: what are the IDs of the closest Gaussians to this point?31Joint sequence modelsLabel bias problem: previous decisions may restrict the influence of future observationsHarder for the system to know that it was following a bad pathIdea: what if we had one big maximum entropy model where we compute the joint probability of hidden variables given observations?Many-diplomat problem:P(Dmat1DmatN|Flag1FlagN,Lights1LightsN)Problem: State space is exponential in lengthDiplomat problem: O(2N)32Factorization of joint sequencesWhat we want is a factorization that will allow us to decrease the size of the state spaceDefine a Markov graph to describe factorization:Markov Random Field (MRF)Neighbors in graph contribute to the probability distributionMore formally: probability distribution is factored by the cliques in a graph33Markov Random Fields (MRFs)MRFs are undirected (joint) graphical modelsCliques define probability distributionConfiguration size of each clique is the effective state spaceConsider 5-diplomat seriesD1D2D3D4D5D1D2D3D4D5D1D2D3D4D5One 5-clique (fully connected) Effective state space is 25 (MaxEnt)Three 3-cliques (1-2-3, 2-3-4, 3-4-5)Effective state space is 23Four 2-cliques (1-2, 2-3, 3-4, 4-5)Effective state space is 2234Hammersley-Clifford TheoremHammersley-Clifford Theorem related MRFs to Gibbs probability distributionsIf you can express the probability of a graph configuration as a product of potentials on the cliques (Gibbs distribution), then the graph is an MRF

The potentials, however, must be positiveTrue if f(c)=exp(S lf(c)) (log linear form)

D1D2D3D4D5

35Conditional Random Fields (CRFs)When the MRF is conditioned on observations, this is known as a Conditional Random Field (CRF)(Lafferty, McCallum & Pereira, 2001)Assuming log-linear form (true of almost all CRFs), then probability is determined by weighted functions (fi) of the clique (c) and the observations (O)

36Conditional Random Fields (CRFs)When the MRF is conditioned on observations, this is known as a Conditional Random Field (CRF)(Lafferty, McCallum & Pereira, 2001)Assuming log-linear form (true of almost all CRFs), then probability is determined by weighted functions (fi) of the clique (c) and the observations (O)

For general graphs, computingthis quantity is #P-hard, requiringapproximate inference.

However, for special graphs thecomplexity is lower. For example,linear chain CRFs have polynomialtime algorithms. Log-linear Linear Chain CRFsLinear-chain CRFs have a 1st order Markov backboneFeature templates for a HMM-like CRF structure for the Diplomat problem

fBias(Di=x, i) is 1 iff Di=xfTrans(Di=x,Di+1=y,i) is 1 iff Di=x and Di+1=yfFlag(Di=x,Flagi=y,i) is 1 iff Di=x and Flagi=yfLights(Di=x,Lightsi=y,i) is 1 iff Di=x and Lightsi=y

With a bit of subscript liberty, the equation is

D1D2D3D4D538Log-linear Linear Chain CRFsIn the previous example, the transitions did not depend on the observations (HMM-like)In general, transitions may depend on observations (MEMM-like)

General form of linear chain CRF groups features as state features (bias, flag, lights) or transition featuresLet s range over state features, t over transition featuresi indexes into the sequence to pick out relevant observations

39A quick note on features for ASRBoth MEMMs and CRFS require the definition of feature functionsSomewhat obvious in NLP (word id, POS tag, parse structure)In ASR, need some sort of symbolic representation of the acousticsWhat are the closest Gaussians (Kuo & Gao, Hifny & Renals)Sufficient statistics (Layton & Gales, Gunawardana et al)With sufficient statistics, can exactly replicate single Gaussian HMM in CRF, or mixture of Gaussians in HCRF (next!)Other classifiers (e.g. MLPs) (Morris & Fosler-Lussier)Phoneme/Multi-Phone detections (Zweig & Nguyen)

40Sequencing: Hidden Structure (1)theDETdogNranVSo far there has been a 1-to-1 correspondence between labels and observations And it has been fully observed in training41Sequencing: Hidden Structure (2)But this is often not the case for speech recognitionSuppose we have training data like this:

The DogTranscriptAudio (spectral representation)4242Sequencing: Hidden Structure (3)DHIYIYDAHAHG

Is The dog segmented like this?4343Sequencing: Hidden Structure (3)DHDHIYDAHAHG

Or like this?4444Sequencing: Hidden Structure (3)DHDHIYDAHGG

Or maybe like this?=> An added layer of complexity4545This Can Apply in NLP as wellHey John Deb Abrams calling how are youcallerHey John Deb Abrams calling how are youcallercalleecalleeHow should this be segmented?Note that a segment level feature indicating that Deb Abrams is a good name would be useful46Approaches to Hidden Structure Hidden CRFs (HRCFs) Gunawardana et al., 2005Semi-Markov CRFsSarawagi & Cohen, 2005Conditional Augmented ModelsLayton, 2006 Thesis Lattice C-Aug Chapter; Zhang, Ragni & Gales, 2010Segmental CRFsZweig & Nguyen, 2009

These differ in Where the Markov assumption is applied What labels are available at trainingConvexity of objective functionDefinition of features47Approaches to Hidden StructureMethodMarkov AssumptionSegmentation known in TrainingFeatures PrescribedHCRFFrame levelNoNoSemi-Markov CRFSegmentYesNoConditional Augmented ModelsSegmentNoYesSegmental CRFSegmentNoNo48One View of StructureDHAETDHTAEConsider all segmentations consistent with transcription / hypothesisApply Markov assumption at frame level to simplify recursionsAppropriate for frame level features49Another View of StructureDHTo1onAEDHTo1onAEConsider all segmentations consistent with transcription / hypothesisApply Markov assumption at segment level only Semi MarkovThis means long-span segmental features can be used50Examples of Segment Level Features in ASRFormant trajectoriesDuration modelsSyllable / phoneme countsMin/max energy excursionsExistence, expectation & levenshtein features described later

51Examples of Segment Level Features in NLPSegment includes a namePOS pattern within segment is DET ADJ NNumber of capitalized words in segmentSegment is labeled Name and has 2 wordsSegment is labeled Name and has 4 wordsSegment is labeled Phone Number and has 7 wordsSegment is labeled Phone Number and has 8 words52Is Segmental Analysis any Different?We are conditioning on all the observationsDo we really need to hypothesize segment boundaries?YES many features undefined otherwise:Duration (of what?)Syllable/phoneme count (count where?)Difference in C0 between start and end of word

Key Example: Conditional Augmented Statistical Models53Conditional Augmented Statistical ModelsLayton & Gales, Augmented Statistical Models for Speech Recognition, ICASSP 2006As features useLikelihood of segment wrt an HMM modelDerivative of likelihood wrt each HMM model parameterFrame-wise conditional independence assumptions of HMM are no longer presentDefined only at segment level54Now for Some DetailsWill examine general segmental caseThen relate specific approachesDHTo1onAEDHTo1onAE55Segmental Notation & Fine PrintWe will consider feature functions that cover both transitions and observationsSo a more accurate representation actually has diagonal edgesBut well generally omit them for simpler picturesLook at a segmentation q in terms of its edges esle is the label associated with the left state on an edge sre is the label associated with the right state on an edgeO(e) is the span of observations associated with an edge

slesreo(e)=o34e56The Segmental Equationsslesreo(e)=o34e

We must sum over all possible segmentations of the observations consistentwith a hypothesized state sequence .57Conditional Augmented Model (Lattice version) in this View

Features precisely definedHMM model likelihoodDerivatives of HMM model likelihood wrt HMM parameters

58 58HCRF in this View

Feature functions are decomposable at the frame levelLeads to simpler computations59Semi-Markov CRF in this View

A fixed segmentation is known at trainingOptimization of parameters becomes convex60 60Structure SummarySometimes only high-level information is available E.g. the words someone said (training)The words we think someone said (decoding)Then we must consider all the segmentations of the observations consistent with thisHCRFs do this using a frame-level Markov assumptionSemi-CRFs / Segmental CRFs do not assume independence between framesDownside: computations more complexUpside: can use segment level features Conditional Augmented Models prescribe a set of HMM based features61BreakPart 2: AlgorithmsCompute optimal label sequence (decoding)

Compute likelihood of a label sequence

Compute optimal parameters (training)

Key Tasks

64Key CasesViterbi AssumptionHidden StructureModelNANALog-linear classificationFrame-levelNoCRFFrame-levelYesHCRFSegment-levelYes (decode only)Semi-Markov CRFSegment-levelYes (train & decode)C-Aug, Segmental CRF65DecodingThe simplest of the algorithmsStraightforward DP recursionsViterbi AssumptionHidden StructureModelNANALog-linear classificationFrame-levelNoCRFFrame-levelYesHCRFSegment-levelYes (decode only)Semi-Markov CRFSegment-levelYes (train & decode)C-Aug, Segmental CRFCases we will go over66Flat log-linear Model

Simply enumerate the possibilities and pick the best.67A Chain-Structured CRFsjsj-1oj

Since s is a sequence there might be too many to enumerate.6868Chain-Structured Recursionssm=qsm-1=qomd(m,q) is the best label sequence score that ends in position m with label q

Recursively compute the ds Keep track of the best q decisions to recover the sequenceThe best way of getting hereis the best way of getting heresomehow and then making the transition and accounting forthe observation6969Segmental/Semi-Markov CRF

o1onsreomom-do(e)slee70Segmental/Semi-Markov Recursionsd(m,y) is the best label sequence score that ends at observation m with state label y

Recursively compute the ds Keep track of the best q and d decisions to recover the sequenceo1onyomom-d71yComputing Likelihood of a State SequenceViterbi AssumptionHidden StructureModelNANAFlat log-linearFrame-levelNoCRFFrame-levelYesHCRFSegment-levelYes (decode only)Semi-Markov CRFSegment-levelYes (train & decode)C-Aug, Segmental CRFCases we will go over72Flat log-linear Model

Enumerate the possibilities and sum.73Plug in hypothesisA Chain-Structured CRFsjsj-1oj

Single hypothesis sPlug in and computeNeed a clever way ofsumming over all hypothesesTo get normalizer Z7474CRF Recursionsa(m,q) is the sum of the label sequence scores that end in position m with label q

Recursively compute the asCompute Z and plug in to find P(s|o)

7575Segmental/Semi-Markov CRF

o1onsreomom-do(e)sleeFor segmental CRFnumerator requiresa summation tooBoth Semi-CRF andsegmental CRFrequire the samedenominator sumSCRF Recursions: Denominator

a(m,y) is the sum of the scores of all labelings and segmentationsthat end in position m with label y

Recursively compute the asCompute Z and plug in to find P(s|o) a labela position77SCRF Recursions: Numeratorsyo1onomom-dRecursion is similar with the state sequence fixed.a*(m,y) will now be the sum of the scores of all segmentations ending in an assignment of observation m to the yth state.Note the value of the yth state is given!y is now a positional index rather than state value.78sy-1Numerator (cont.)o1onomom-d

sy

Note again that here y is the position into a given state sequence s79sy-1Summary: SCRF Probability

Compute alphas and numerator-constrained alphas with forward recursionsDo the division

80TrainingViterbi AssumptionHidden StructureModelNANALog-linear classificationFrame-levelNoCRFFrame-levelYesHCRFSegment-levelYes (decode only)Semi-Markov CRFSegment-levelYes (train & decode)C-Aug, Segmental CRFWill go over simplest cases. See also Gunawardana et al., Interspeech 2005 (HCRFs) Mahajan et al., ICASSP 2006 (HCRFs) Sarawagi & Cohen, NIPS 2005 (Semi-Markov) Zweig & Nguyen, ASRU 2009 (Segmental CRFs)

81Training Specialized approachesExploit form of Max-Ent ModelIterative Scaling (Darroch & Ratcliff, 1972)fi(x,y) >= 0 and Si fi(x,y)=1Improved Iterative Scaling (Berger, Della Pietra & Della Pietra, 1996)Only relies on non-negativityGeneral approach: Gradient DescentWrite down the log-likelihood for one data sampleDifferentiate it wrt the model parametersDo your favorite form of gradient descentConjugate gradientNewton methodR-PropApplicable regardless of convexity82Training with Multiple ExamplesWhen multiple examples are present, the contributions to the log-prob (and therefore gradient) are additive

To minimize notation, we omit the indexing and summation on data samples

83Flat log-linear model

84Flat log-linear Model Cont.

This can be computed by enumerating y85A Chain-Structured CRFsjsj-1oj

8686Chain-Structured CRF (cont.)

Second is similar to the simple log-linear model, but: * Cannot enumerate s because it is now a sequence * And must sum over positions jEasy to compute first term87Forward/Backward Recursions

b(m,q) is sum of partialpath scores startingat position m, withlabel q (exclusive of observation m)a(m,q) is sum of partialpath scores endingat position m, withlabel q (inclusive ofobservation m)88Gradient ComputationCompute AlphasCompute BetasCompute gradient

89Segmental VersionsMore complex; SeeSarawagi & Cohen, 2005Zweig & Nguyen, 2009Same basic process holdsCompute alphas on forward recursionCompute betas on backward recursionCombine to compute gradient

90Once We Have the GradientAny gradient descent technique possible

Find a direction to move the parameters Some combination of information from first and second derivative valuesDecide how far to move in that directionFixed or adaptive step sizeLine searchUpdate the parameter values and repeat

91Conventional Wisdom Limited Memory BFGS often works wellLiu & Nocedal, Mathematical Programming (45) 1989Sha & Pereira, HLT-NAACL 2003Malouf, CoNLL 2002For HCRFs stochastic gradient descent and Rprop are as good or betterGunawardana et al., Interspeech 2005Mahajan, Gunawardana & Acero, ICASSP 2006Rprop is exceptionally simple

92Rprop AlgorithmMartin Riedmiller, Rprop Description and Implementation Details Technical Report, January 1994, University of Karlsruhe.Basic idea:Maintain a step size for each parameter Identifies the scale of the parameterSee if the gradient says to increase or decrease the parameter Forget about the exact value of the gradientIf you move in the same direction twice, take a bigger step!If you flip-flop, take a smaller step!

93RegularizationIn machine learning, often want to simplify modelsObjective function can be changed to add a penalty term for complexityTypically this is an L1 or L2 norm of the weight (lambda vector)L1 leads to sparser models than L2For speech processing, some studies have found regularizationNecessary: L1-ACRFs by Hifny & Renals, Speech Communication 2009Unnecessary if using weight averaging across time: Morris & Fosler-Lussier, ICASSP 200794Case Studies (1)CRF Speech Recognition with Phonetic FeaturesAcknowledgements to Jeremy MorrisTop-down vs. bottom-up processingState-of-the-art ASR takes a top-down approach to this problemExtract acoustic features from the signalModel a process that generates these featuresUse these models to find the word sequence that best fits the features

speech/ s p iy ch/96Bottom-up: detector combinationA bottom-up approach using CRFsLook for evidence of speech in the signalPhones, phonological featuresCombine this evidence together in log-linear model to find the most probable sequence of words in the signal

speech/ s p iy ch/voicing?burst?frication?evidencedetectionevidencecombinationvia CRFs(Morris & Fosler-Lussier, 2006-2010)97Phone RecognitionWhat evidence do we have to combine?MLP ANN trained to estimate frame-level posteriors for phonological featuresMLP ANN trained to estimate frame-level posteriors for phone classes

P(voicing|X)P(burst|X)P(frication|X)

P( /ah/ | X)P( /t/ | X)P( /n/ | X)

9898Phone RecognitionUse these MLP outputs to build state feature functions



101101Phone RecognitionPilot task phone recognition on TIMITICSI Quicknet MLPs trained on TIMIT, used as inputs to the CRF modelsCompared to Tandem and a standard PLP HMM baseline modelOutput of ICSI Quicknet MLPs as inputsPhone class attributes (61 outputs)Phonological features attributes (44 outputs)

102Phone Recognition*Signficantly (p

Date post:	24-Feb-2016
Category:	Documents
Upload:	rashad
View:	43 times
Download:	1 times

Conditional Random Fields and Direct Decoding for Speech and Language Processing

Documents