DiscriminativeEstimation(Maxent models and perceptron)
Generativevs.Discriminativemodels
Many slides are adapted from slides by Christopher Manning and perceptron slides by Alan Ritter
Introduction
• Sofarwe’velookedat“generativemodels”• NaiveBayes
• ButthereisnowmuchuseofconditionalordiscriminativeprobabilisticmodelsinNLP,Speech,IR(andMLgenerally)
• Because:• Theygivehighaccuracyperformance• Theymakeiteasytoincorporatelotsoflinguisticallyimportantfeatures• Theyallowautomaticbuildingoflanguageindependent,retargetableNLPmodules
Jointvs.ConditionalModels
• Wehavesomedata{(d,c)}ofpairedobservationsd andhiddenclassesc.
• Joint(generative)modelsplaceprobabilitiesoverbothobserveddataandthehiddenstuff(gene-ratetheobserveddatafromhiddenstuff):• AlltheclassicStatNLP models:• n-grammodels,NaiveBayesclassifiers,hiddenMarkovmodels,probabilisticcontext-freegrammars,IBMmachinetranslationalignmentmodels
P(c,d)
Jointvs.ConditionalModels
• Discriminative(conditional)modelstakethedataasgiven,andputaprobabilityoverhiddenstructuregiventhedata:
• Logisticregression,conditionalloglinear ormaximumentropymodels,conditionalrandomfields
• Also,SVMs,(averaged)perceptron,etc.arediscriminativeclassifiers(butnotdirectlyprobabilistic)
P(c|d)
BayesNet/GraphicalModels
• Bayesnetdiagramsdrawcirclesforrandomvariables,andlinesfordirectdependencies
• Somevariablesareobserved;somearehidden• Eachnodeisalittleclassifier(conditionalprobabilitytable)basedon
incomingarcs c
d1 d 2 d 3
NaiveBayes
c
d1 d2 d3
GenerativeLogisticRegression
Discriminative
Conditionalvs.JointLikelihood
• AjointmodelgivesprobabilitiesP(d,c)andtriestomaximizethisjointlikelihood.• Itturnsouttobetrivialtochooseweights:justrelativefrequencies.
• AconditionalmodelgivesprobabilitiesP(c|d).Ittakesthedataasgivenandmodelsonlytheconditionalprobabilityoftheclass.• Weseektomaximizeconditionallikelihood.• Hardertodo(aswe’llsee…)• Morecloselyrelatedtoclassificationerror.
Maxent ModelsandDiscriminativeEstimation
Generativevs.Discriminativemodels
DiscriminativeModelFeatures
MakingfeaturesfromtextfordiscriminativeNLPmodels
Features
• Intheseslidesandmostmaxent work:features f areelementarypiecesofevidencethatlinkaspectsofwhatweobserved withacategoryc thatwewanttopredict
• Afeatureisafunctionwithaboundedrealvalue:f: C ´ D → ℝ
A Belief: to create a data partition
Features
• InNLPuses,usuallyafeaturespecifies1. anindicatorfunction– ayes/noboolean matchingfunction– of
propertiesoftheinputand2. aparticularclass
fi(c, d) º [Φ(d) Ù c = cj] [Valueis0or1]
• Eachfeaturepicksoutadatasubsetandsuggestsalabelforit
Examplefeatures
• f1(c, d) º [c = LOCATION Ù w-1 = “in” Ù isCapitalized(w)]• f2(c, d) º [c = LOCATION Ù hasAccentedLatinChar(w)]• f3(c, d) º [c = DRUG Ù ends(w, “c”)]
• Modelswillassigntoeachfeatureaweight:• Apositiveweightvotesthatthisconfigurationislikelycorrect• Anegativeweightvotesthatthisconfigurationislikelyincorrect
LOCATIONin Québec
PERSONsaw Sue
DRUGtaking Zantac
LOCATIONin Arcadia
Feature-BasedModels• Thedecisionaboutadatapointisbasedonlyonthe
features activeatthatpoint.
BUSINESS: Stocks hit a yearly low …
Data
Features{…, stocks, hit, a, yearly, low, …}
Label: BUSINESS
Text Categorization
… to restructure bank:MONEY debt.
Data
Features{…, w-1=restructure, w+1=debt, …}
Label: MONEY
Word-Sense Disambiguation
DT JJ NN …The previous fall …
Data
Features{w=fall, t-1=JJ w-
1=previous}
Label: NN
POS Tagging
Example:TextCategorization
(ZhangandOles 2001)• Featuresarepresenceofeachword inadocumentandthedocumentclass
(theydofeatureselectiontousereliableindicatorwords)• TestsonclassicReutersdataset(andothers)
• NaïveBayes:77.0%F1• Linearregression:86.0%• Logisticregression:86.4%• Supportvectormachine:86.5%
• Paperemphasizestheimportanceofregularization (smoothing)forsuccessfuluseofdiscriminativemethods(notusedinmuchearlyNLP/IRwork)
OtherMaxent ClassifierExamples
• Youcanuseamaxent classifierwheneveryouwanttoassigndatapointstooneofanumberofclasses:• Sentenceboundarydetection(Mikheev 2000)
• Isaperiodendofsentenceorabbreviation?• Sentimentanalysis(PangandLee2002)• Wordunigrams,bigrams,POScounts,…
• PPattachment(Ratnaparkhi 1998)• Attachtoverbornoun?Featuresofheadnoun,preposition,etc.
• Parsingdecisionsingeneral(Ratnaparkhi 1997;Johnsonetal.1999,etc.)
DiscriminativeModelFeatures
MakingfeaturesfromtextfordiscriminativeNLPmodels
Feature-basedLinearClassifiers
Howtoputfeaturesintoaclassifier
16
Feature-BasedLinearClassifiers
• Linearclassifiersatclassificationtime:• Linear function from feature sets {fi} to classes {c}.• Assign a weight li to each feature fi.
• We consider each class for an observed datum d• For a pair (c,d), features vote with their weights:
• vote(c) = Slifi(c,d)
• Choose the class c which maximizes Slifi(c,d)
LOCATIONin Québec
DRUGin Québec
PERSONin Québec
Feature-BasedLinearClassifiers
• Linearclassifiersatclassificationtime:• Linear function from feature sets {fi} to classes {c}.• Assign a weight li to each feature fi.
• We consider each class for an observed datum d• For a pair (c,d), features vote with their weights:
• vote(c) = Slifi(c,d)
• Choose the class c which maximizes Slifi(c,d) = LOCATION
1.8 –0.60.3LOCATION
in QuébecDRUG
in QuébecPERSON
in Québec
Feature-BasedLinearClassifiers
TherearemanywaystochoseweightsforfeaturesWithdifferentlossfunctionsastheoptimizationgoal
• Perceptron:findacurrentlymisclassifiedexample,andnudgeweightsinthedirectionofitscorrectclassification
• Margin-basedmethods(SupportVectorMachines)
Feature-BasedLinearClassifiers• Exponential(log-linear,maxent,logistic,Gibbs)models:
• MakeaprobabilisticmodelfromthelinearcombinationSlifi(c,d)
• P(LOCATION|in Québec) = e1.8e–0.6/(e1.8e–0.6 + e0.3 + e0) = 0.586
• P(DRUG|in Québec) = e0.3 /(e1.8e–0.6 + e0.3 + e0) = 0.238
• P(PERSON|in Québec) = e0 /(e1.8e–0.6 + e0.3 + e0) = 0.176
• The weights are the parameters of the probability model, combined via a “soft max” function
∑ ∑'
),'(expc i
ii dcfλ=),|( λdcP
∑i
ii dcf ),(exp λ Makes votes positive
Normalizes votes
Aside:logisticregression
• Maxent modelsinNLPareessentiallythesameasmulticlasslogisticregressionmodelsinstatistics(ormachinelearning)
• ThekeyroleoffeaturefunctionsinNLPandinthispresentation• Thefeaturesaremoregeneral,withf alsobeingafunctionoftheclass
21
Quiz Question
• Assuming exactly the same set up (3 class decision: LOCATION, PERSON, or DRUG; 3 features as before, maxent), what are:• P(PERSON| byGoéric) =
• P(LOCATION| byGoéric) =
• P(DRUG| byGoéric) =
• 1.8 f1(c, d) º [c = LOCATION Ù w-1 = “in” Ù isCapitalized(w)]• -0.6 f2(c, d) º [c = LOCATION Ù hasAccentedLatinChar(w)]• 0.3 f3(c, d) º [c = DRUG Ù ends(w, “c”)]
∑ ∑'
),'(expc i
ii dcfλ=),|( λdcP ∑
iii dcf ),(exp λ
PERSONby Goéric
LOCATIONby Goéric
DRUGby Goéric
Feature-basedLinearClassifiers
Howtoputfeaturesintoaclassifier
23
BuildingaMaxentModel
Thenutsandbolts
BuildingaMaxent Model
• Wedefinefeatures(indicatorfunctions)overdatapoints• Featuresrepresentsetsofdatapointswhicharedistinctiveenoughtodeservemodelparameters.• Words,butalso“wordcontainsnumber”,“wordendswithing”,etc.
• WewillsimplyencodeeachΦ featureasauniqueString(index)• AdatumwillgiverisetoasetofStrings:theactiveΦ features• Eachfeaturefi(c, d) º [Φ(d) Ù c = cj] getsarealnumberweight
• WeconcentrateonΦ featuresbutthemathusesi indicesoffi
BuildingaMaxentModel• Featuresareoftenaddedduringmodeldevelopmenttotargeterrors
• Often,theeasiestthingtothinkofarefeaturesthatmarkbadcombinations
• Then,foranygivenfeatureweights,wewanttobeabletocalculate:• Dataconditionallikelihood• Derivativeofthelikelihoodwrt eachfeatureweight• Usesexpectationsofeachfeatureaccordingtothemodel
• Wecanthenfindtheoptimumfeatureweights(discussedlater).
BuildingaMaxentModel
Thenutsandbolts
NaiveBayesvs.Maxent models
Generativevs.Discriminativemodels:Theproblemofovercounting evidence
Textclassification:AsiaorEurope
NBFACTORS:• P(A)=P(E)=• P(M|A)=• P(M|E)=• P(H|A)=P(K|A)=• P(H|E)=PK|E)=
Europe Asia
Class
H K
NBModel PREDICTIONS:• P(A,H,K,M)=• P(E,H,K,M)=• P(A|H,K,M)=• P(E|H,K,M)=
TrainingData
M
Monaco Monaco
Monaco Monaco Hong Kong
Hong Kong Monaco
Monaco Hong Kong
Hong Kong
Monaco Monaco
NaiveBayesvs.Maxent Models
• NaiveBayesmodelsmulti-countcorrelatedevidence• Eachfeatureismultipliedin,evenwhenyouhavemultiplefeaturestellingyouthesamething
• MaximumEntropymodels(prettymuch)solvethisproblem• Aswewillsee,thisisdonebyweightingfeaturessothatmodelexpectationsmatchtheobserved(empirical)expectations
NaiveBayesvs.Maxent models
Generativevs.Discriminativemodels:Theproblemofovercounting evidence
Maxent ModelsandDiscriminativeEstimation
Maximizingthelikelihood
FeatureExpectations
• Wewillcruciallymakeuseoftwoexpectations• actualorpredictedcountsofafeaturefiring:
• Empiricalcount(expectation)ofafeature:
• Modelexpectationofafeature:
∑ ∈=
),(observed),(),()( empirical
DCdc ii dcffE
∑ ∈=
),(),(),(),()(
DCdc ii dcfdcPfE
Goal: well fit the data
ExponentialModelLikelihood
• Maximum(Conditional)LikelihoodModels:• Givenamodelform,choosevaluesofparameterstomaximizethe(conditional)likelihoodofthedata.
∑∑∈∈
==),(),(),(),(
log),|(log),|(logDCdcDCdc
dcPDCP λλ∑ ∑'
),'(expc i
ii dcfλ
∑i
ii dcf ),(exp λ
TheLikelihoodValue
• The(log)conditionallikelihoodofiid data(C,D)accordingtomaxent modelisafunctionofthedataandtheparametersl:
• Iftherearen’tmanyvaluesofc,it’seasytocalculate:
∑∏∈∈
==),(),(),(),(
),|(log),|(log),|(logDCdcDCdc
dcPdcPDCP λλλ
∑∈
=),(),(
log),|(logDCdc
DCP λ∑ ∑'
),'(expc i
ii dcfλ
∑i
ii dcf ),(exp λ
TheLikelihoodValue
• Wecanseparatethisintotwocomponents:
• Thederivativeisthedifferencebetweenthederivativesofeachcomponent
∑ ∑ ∑∈ ),(),( '
),'(explogDCdc c i
ii dcfλ∑ ∑∈ ),(),(
),(explogDCdc i
ii dcfλ −=),|(log λDCP
)(λN )(λM=),|(log λDCP −
TheDerivativeI:Numerator
Derivativeofthenumeratoris:theempiricalcount(fi,c)
i
DCdc iii dcf
λ
λ
∂
∂
=∑ ∑∈ ),(),(
),(
∑∑
∈ ∂
∂=
),(),(
),(
DCdc i
iii dcf
λ
λ
∑∈
=),(),(
),(DCdci dcf
i
DCdc iici
i
dcfN
λ
λ
λλ
∂
∂
=∂
∂∑ ∑∈ ),(),(
),(explog)(
TheDerivativeII:Denominator
i
DCdc c iii
i
dcfM
λ
λ
λλ
∂
∂
=∂
∂∑ ∑ ∑∈ ),(),( '
),'(explog)(
∑∑ ∑
∑ ∑∈ ∂
∂=
),(),(
'
''
),'(exp
),''(exp1
DCdc i
c iii
c iii
dcf
dcf λ
λ
λ
∑ ∑∑∑
∑ ∑∈ ∂
∂=
),(),( '''
),'(
1
),'(exp
),''(exp1
DCdc c i
iii
iii
c iii
dcfdcf
dcf λ
λλ
λ
i
iii
DCdc cc i
ii
iii dcf
dcf
dcf
λ
λ
λ
λ
∂
∂=
∑∑ ∑∑ ∑
∑∈
),'(
),''(exp
),'(exp
),(),( '''
∑ ∑∈
=),(),( '
),'(),|'(DCdc
ic
dcfdcP λ =predictedcount(fi,l)
TheDerivativeIII
• Theoptimumparametersaretheonesforwhicheachfeature’spredictedexpectationequalsitsempiricalexpectation.Theoptimumdistributionis:• Alwaysunique(butparametersmaynotbeunique)• Alwaysexists(iffeaturecountsarefromactualdata).
• Thesemodelsarealsocalledmaximumentropymodelsbecausewefindthemodelhavingmaximumentropyandsatisfyingtheconstraints:
=∂
∂
i
DCPλ
λ),|(log),(countactual Cfi ),(countpredicted λif−
jfEfE jpjp ∀= ),()( ~
Findingtheoptimalparameters
• Wewanttochooseparametersλ1,λ2,λ3,…thatmaximizetheconditionallog-likelihoodofthetrainingdata
• Tobeabletodothat,we’veworkedouthowtocalculatethefunctionvalueanditspartialderivatives(itsgradient)
)|(log)(1
i
n
ii dcPDCLogLik ∑
=
=
Alikelihoodsurface
Findingtheoptimalparameters
• Useyourfavoritenumericaloptimizationpackage….• Commonly,youminimize thenegativeofCLogLik
1. Gradientdescent(GD);Stochasticgradientdescent(SGD)2. Iterativeproportionalfittingmethods:GeneralizedIterativeScaling
(GIS)andImprovedIterativeScaling(IIS)3. Conjugategradient(CG),perhapswithpreconditioning4. Quasi-Newtonmethods– limitedmemoryvariablemetric(LMVM)
methods,inparticular,L-BFGS
GradientDescent(GD)
43
Maxent ModelsandDiscriminativeEstimation
Maximizingthelikelihood
FeatureSparsityRegularization
Combatingoverfitting
Smoothing:IssuesofScale• Lotsoffeatures:
• NLPmaxent modelscanhavewelloveramillionfeatures.• Evenstoringasinglearrayofparametervaluescanhaveasubstantialmemorycost.
• Lotsofsparsity:• Overfitting veryeasy– weneedsmoothing!• Manyfeaturesseenintrainingwillneveroccuragainattesttime.
• Optimizationproblems:• Featureweightscanbeinfinite,anditerativesolverscantakealongtimetogetto
thoseinfinities.
Smoothing/Priors/Regularization
Standardvs.RegularizedUpdates
48
FeatureSparsityRegularization
Combatingoverfitting
Batchvs.OnlineLearning
GDvs.SGD
StochasticGradientDecent(SGD)
51
Batch vs. Online learning:
Batchvs.OnlineLearning
GDvs.SGD
Perceptron
AnotherOnlineLearningalgorithem
Perceptron Algorithm
54
MaxEnt v.s Perceptron
• Perceptron doesn’t always make updates• Probabilities v.s scores
55
RegularizationinthePerceptronAlgorithm
• No gradient computed,so can’t directly include a regularizer inan object function.
• Insteadrun different numbers of iterations• Use parameter averaging, for instance, average of all
parameters after seeing each data point
56