StatisticalLanguageModeling
• Intuition:bylookingatlargequantitiesoftextwecanfindstatisticalregularities– Distinguishbetweencorrectandincorrectsentences
• Languagemodelsdefineaprobabilitydistributionoverstrings(e.g.,sentences)inalanguage.
• Wecanuselanguagemodeltoscoreandranksentences
“I don’t know {whether,weather} to laugh or cry”
P(“Idon’t..weathertolaugh..”)><P(“Idon’t..whethertolaugh..”)
LanguageModelingwithN-grams
U n i g ra m m o delB i g ra m m o del
Tr i g ra m m o del
P (w1)P (w2)...P (wi )P (w1)P (w2 |w1)...P (wi |wi-1)P (w1)P (w2 |w1)...P (wi |wi-2 wi-1)
EvaluatingLanguageModels
• Assumingthatwehavealanguagemodel,howcanwetellifit’sgood?
• Option1:trytogenerateShakespeare..– ThisisknowasQualitativeevaluation
• Option2:Quantitativeevaluation– Option2.1:SeehowwellyoudoonSpellingcorrection
• ThisisknownasExtrinsic Evaluation– Option2.2:Findanindependentmeasure forLMquality
• ThisisknownasIntrinsic Evaluation
WhenareLMapplicable?
DeepVisual-SemanticAlignments forGeneratingImageDescriptions
Findingregularityinlanguageissurprisinglyuseful!Easyexample:weather/whether
Butalso-Translation(canyouproduce“legal”FrenchfromsourceEnglish?)
CaptionGeneration(combineoutputofvisualsensorsintoagrammaticalsentence)
Classification
• Afundamentalmachine learningtool– WidelyapplicableinNLP
• Supervisedlearning:Learnerisgivenacollectionoflabeleddocuments– Emails:Spam/notspam;Reviews:Pos/Neg
• Buildafunction mappingdocumentstolabels– Keyproperty:Generalization
• functionshouldworkwellonnewdata
SentimentAnalysis
Dude, I just watched this horror flick! Selling points: nightmares scenes, torture scenes, terrible monsters that was so bad a##!
Don’t buy the popcorn it was terrible, the monsters selling it must have wanted to torture me, it was so bad it gave me nightmares!
What should your learning algorithm look at?
DeceptiveReviews
FindingDeceptiveOpinion SpambyAnyStretchoftheImagination.Ott etal.ACL2011What should your learning algorithm look at?
PowerRelations
EchoesofPower:LanguageEffectsandPowerDifferencesinSocialInteraction.Danescu-Niculescu-Mizil et-al .WWW2012.
BlahUnacceptableblah
Yourhonor, Iagreeblahblahblah
What should your learning algorithm look at?
PowerRelationsCommunicative behaviors are “patterned and coordinated, like a dance” [Niederhoffer and Pennebaker 2002]
EchoesofPower:LanguageEffectsandPowerDifferencesinSocialInteraction.Danescu-Niculescu-Mizil et-al.WWW2012.
Classification
• Weassumewehavealabeleddataset.Howcanwebuildaclassifier?
• Decideonarepresentation andalearningalgorithm– Essentially:Functionapproximation
• Representation:Whatisthedomainofthefunction• Learning:Howtofindagood approximation
– Wewilllookintoseveralsimpleexamples• NaïveBayes,Perceptron
• Let’sstartwithsomedefinitions..
BasicDefinitions• Given:Dasetoflabeledexamples{<x,y>}• Goal:Learnafunctionf(x)s.t. f(x)=y
– Note:ycanbebinary,orcategorical– Typically theinputxisrepresented asavectoroffeatures
• BreakDintothreeparts:– Trainingset(usedbythelearningalgorithm)– Test set(evaluatethelearnedmodel)– Development set(tuningthelearningalgorithm)
• Evaluation:– performancemeasureoverthetestset– Accuracy:proportionofcorrectpredictions(testdata)
PrecisionandRecall• Givenadataset,wetrainaclassifierthatgets99%accuracy
• Didwedoagoodjob?• Buildaclassifierforbraintumor:
– 99.9%ofbrainscansdonotshowsignsoftumor– Didwedoagoodjob?
• Bysimplysaying“NO”toallexampleswereducetheerrorbyafactorof10!– ClearlyAccuracyisnotthebestwaytoevaluatethelearningsystemwhenthedataisheavilyskewed!
• Intuition:weneedameasurethatcapturestheclasswecareabout!(rare)
13
PrecisionandRecall
• Thelearnercanmaketwokindsofmistakes:– FalsePositive– FalseNegative
• Precision:• “when we predicted the rare class, how often are we right?”
• Recall• “Out of all the instances of the rare class, how many did we
catch?”
14
TrueLabel:
TrueLabel:
Predicted:
TruePositive
FalsePositive
Predicted:
FalseNegative
TrueNegative
0110
True Pos
Predicted Pos=
True Pos
True Pos + False Pos
True Pos
Actual Pos=
True Pos
True Pos + False Neg
F-Score
• PrecisionandRecallgiveustworeferencepointstocomparelearningperformance
• Whichalgorithmisbetter?• Option1:Average• Option2:F-Score
15
Precision Recall Average FScore
Algorithm1 0.5 0.4 0.45 0.444
Algorithm2 0.7 0.1 0.4 0.175
Algorithm3 0.02 1 0.51 0.0392
2PR
P +R
P +R
2Propertiesoff-score:• Rangesbetween0-1• Prefersprecisionandrecall
withsimilarvalues
Weneedasinglescore
SimpleExample:NaïveBayes
• NaïveBayes:simpleprobabilisticclassifier– Givenasetoflabeleddata:
• DocumentsD,eachassociatedwithalabelv• Simplefeaturerepresentation:BoW
– Learning:• ConstructaprobabilitydistributionP(v|d)
– Prediction:• Assignthelabelwiththehighestprobability
• Reliesonstrong simplifyingassumptions
SimpleRepresentation:BoW
• Basicidea:(sentimentanalysis)– “I loved this movie, it’s awesome! I couldn’t stop
laughing for two hours!”– Mappinginputtolabelcanbedonebyrepresentingthefrequenciesofindividual words
– Document=wordcounts
• Simple,yetsurprisinglypowerfulrepresentation!
BayesRule
• NaïveBayesisasimpleprobabilisticclassificationmethod,basedonBayesrule.
P(v | d) = P(d | v)P(v)P(d)
BasicsofNaïveBayes• P(v) - thepriorprobability ofalabelv
Reflectsbackgroundknowledge;beforedataisobserved.Ifnoinformation- uniformdistribution.
• P(D) - Theprobabilitythatthissample oftheDataisobserved.(Noknowledgeofthelabel)
• P(D|v): TheprobabilityofobservingthesampleD,giventhatthelabelv isthetarget(Likelihood)–
• P(v|D): Theposteriorprobability of v.Theprobabilitythatv isthetarget,giventhatD hasbeenobserved.
19
BayesRule
• NaïveBayesisasimpleclassificationmethod,basedonBayesrule.
Check your intuition:
P(v|d)increaseswithP(v)andwithP(d|v)
P(v|d)decreaseswithP(d)
P(v | d) = P(d | v)P(v)P(d)
NaïveBayes
• Thelearnerconsidersasetofcandidatelabels,andattemptstofindthemostprobable onev∈V,giventheobserveddata.
• Suchmaximallyprobableassignmentiscalledmaximumaposteriori assignment(MAP);Bayestheoremisusedtocomputeit:
P(v|D)=P(D|v)P(v)/P(D)
vMAP = argmaxv∈ V P(v|D) = argmaxv∈ v P(D|v) P(v)/P(D)
= argmaxv∈ V P(D|v) P(v) SinceP(D)isthesameforallv∈ V
NaïveBayes
• HowcanwecomputeP(v |D)?– Basic idea: represent document as a set of
features, such as BoW features
vMAP = argmaxvj∈V
P(x1,x2, ...,xn | vj )P(vj)P(x1,x2, ...,xn )
= argmaxvj∈VP(x1,x2, ...,xn | vj )P(vj)
)x,...,x,x|P(vargmax x)|P(vargmax v n21jVvjVvMAP jj ∈∈ ==
NB:ParameterEstimation
• Giventrainingdatawecanestimatethetwoterms• EstimatingP(v) iseasy.Foreachvaluevcounthow
manytimesitappearsinthetrainingdata.
• However,itisnotfeasibletoestimate P(x1,…,xn |v)– Inthiscasewehavetoestimate,foreachtargetvalue,the
probabilityofeachinstance(mostofwhichwillnotoccur)
• InordertouseaBayesianclassifiersinpractice,weneedtomakeassumptionsthatwillallowustoestimatethesequantities.
VMAP = argmaxv P(x1, x2, …, xn | v )P(v)
23
Question:Assume binary xi’s. How many parameters does the model require?
NB:IndependenceAssumption
• Bagofwordsrepresentation:– Wordpositioncanbeignored
• ConditionalIndependence:Assumefeatureprobabilitiesareindependentgiventhelabel– P(xi|vj)=P(xi|xi-1;vj)
• Bothassumptionsare nottrue– Helpsimplifythemodel– Simplemodelsworkwell
NaiveBayes
25
P(x1,x2, ...,xn | vj ) =
= P(x1 | x2, ...,xn , vj)P(x2, ...,xn | vj) = P(x1 | x2, ...,xn , vj)P(x2 | x3, ...,xn , vj)P(x3, ...,xn | vj) = ....... = P(x1 | x2, ...,xn , vj)P(x2 | x3, ...,xn , vj)P(x3 | x4 , ...,xn , vj)...P(xn | vj)
VMAP = argmaxv P(x1, x2, …, xn | v )P(v)
Assumption:featurevaluesareindependentgiventhetargetvalue
= P(xi | vj )i=1
n∏
EstimatingProbabilities(MLE)
26
n
ndocuments) (v#
documents) v in training in appears (word # v)|P(word kkk ==
Sparsity of data is a problem-- if is small, the estimate is not accurate-- if is 0, it will dominate the estimate: we will never predict
if a word that never appeared in training (with ) appears in the test data
n kn v
v
∏ == ∈ i iidislike}{like,vNB v)|wordP(xP(v)argmax v
v)|P(wordkHowdoweestimate ?
Assumeadocumentclassificationproblem, usingwordfeatures
RobustEstimationofProbabilities
• Thisprocessiscalledsmoothing.• Therearemanywaystodoit,somebetterjustifiedthanothers;• Anempiricalissue.
• Here:• nk is#(of occurrences of the word in the presence of v)• n is#(of occurrences of the label v)• pisapriorestimateofv(e.g.,uniform)• misequivalentsamplesize(#oflabels)
• LaplaceRule:fortheBooleancase,p=1/2,m=2
27
∏ == ∈ i iidislike}{like,vNB v)|wordP(xP(v)argmax v
mnmpn v)|P(x k
k +
+=
2n1n v)|P(x k
k +
+=
NaïveBayes• Veryeasytoimplement• Convergesveryquickly
– Learningisjustcounting• Performswellinpractice
– Appliedtomanydocumentclassificationtasks– Ifdatasetissmall,NBcanperformbetterthansophisticatedalgorithms
• Strongindependenceassumptions– Ifassumptionshold:NBistheoptimalclassifier– Evenifnot,canperformwell
• Next:fromNBtolearninglinearthresholdfunctions
NaïveBayes:TwoClasses
• NoticethatthenaïveBayesmethodgivesamethodforpredictingratherthananexplicitclassifier
• Inthecaseoftwoclasses,v∈{0,1} wepredictthatv=1 iff:
29
10)v|P(x0)P(v
1)v|P(x1)P(vn
1i jij
n
1i jij>
=•=
=•=
∏∏
=
=
NaïveBayes:TwoClasses
30
• NoticethatthenaïveBayesmethodgivesamethodforpredictingratherthananexplicitclassifier.
• Inthecaseoftwoclasses,v∈ {0,1} wepredictthatv=1 iff:
10)v|P(x0)P(v
1)v|P(x1)P(vn
1i jij
n
1i jij>
=•=
=•=
∏∏
=
=
Denote: pi = P(xi = 1 | v = 1), qi = P(xi = 1 | v = 0)
P(vj = 1)• pi
xi (1 - pi )1-xi
i=1
n∏
P(vj = 0)• qixi (1 - qi )
1-xi
i=1
n∏
> 1
NaïveBayes:TwoClasses
31
Inthecaseoftwoclasses,v∈ {0,1} wepredictthatv=1 iff:
1)
q-1q)(q-(10)P(v
)p-1
p)(p-(11)P(v
)q-(1q0)P(v
)p-(1p1)P(v
n
1ix
i
iij
n
1ix
i
iij
n
1ix-1
ix
ij
n
1ix-1
ix
ij
i
i
ii
ii
>•=
•=
=•=
•=
∏
∏
∏∏
=
=
=
=
NaïveBayes:TwoClasses
32
Inthecaseoftwoclasses,v∈ {0,1} wepredictthatv=1 iff:
1)
q-1q)(q-(10)P(v
)p-1
p)(p-(11)P(v
)q-(1q0)P(v
)p-(1p1)P(v
n
1ix
i
iij
n
1ix
i
iij
n
1ix-1
ix
ij
n
1ix-1
ix
ij
i
i
ii
ii
>•=
•=
=•=
•=
∏
∏
∏∏
=
=
=
=
0)xq-1
qlogp-1
p(logq-1p-1log
0)P(v1)P(v
log
:iff 1vpredict we logarithm; Take
ii
ii
i
ii
i
i
j
j >−++=
=
=
∑∑
NaïveBayes:TwoClasses
Introduction toMachineLearning.Fall2014 33
Inthecaseoftwoclasses,v∈ {0,1} wepredictthatv=1 iff:
0)xq-1
qlogp-1
p(logq-1p-1log
0)P(v1)P(v
log
:iff 1vpredict we logarithm; Take
ii
ii
i
ii
i
i
j
j >−++=
=
=
∑∑
€
wi = log pi
1 - pi
− log qi
1 - qi
= log pi
qi
1 - qi
1 - pi
if pi = qi then wi = 0 and the feature is irrelevant
•We get that naive Bayes is a linear separator with :
1)
q-1q)(q-(10)P(v
)p-1
p)(p-(11)P(v
)q-(1q0)P(v
)p-(1p1)P(v
n
1ix
i
iij
n
1ix
i
iij
n
1ix-1
ix
ij
n
1ix-1
ix
ij
i
i
ii
ii
>•=
•=
=•=
•=
∏
∏
∏∏
=
=
=
=
LinearClassifiers
• Linearthresholdfunctions– Associateaweight(wi)witheachfeature(xi)– Prediction:sign(b + wTx) = sign (b + Σ wi xi)
• b + wTx ≥ 0predicty=1• Otherwise,predicty=-1
• NBisalinearthresholdfunction– Weightvector(w)isassignedbycomputingconditionalprobabilities
• Infact,Linearthresholdfunctionsareaverypopularrepresentation!
LinearClassifiers
Eachpointinthisspaceisadocument.
Thecoordinates(e.g.,x1,x2),aredeterminedbyfeatureactivations
++++
+
+
++
- -
- - -- - - - -
++++
+++
sign(b + wTx)
Expressivity
• Linearfunctionsarequiteexpressive– Existsalinearfunctionthatisconsistentwiththedata
• Afamousnegativeexamples(XOR):
++++
+
+
+++
+++
+++
++++
+
+
+++
+++
+++
- -
- - -- - - - -
- -
- - -- - - - -
ExpressivityBytransformingthefeaturespacethesefunctionscanbemadelinear
Representeachpointin2Das(x,x2)
Expressivity
Morerealisticscenario:thedataisalmost linearlyseparable,exceptforsomenoise.
- --
- -- - - - -
++++
+
+
+++
+++
+++
sign(b + wTx)
-
+
--+
++++
+
+
+++
+++
+++
+
-
Features• SofarwehavediscussedBoW representation
– Infact,youcanuseaveryrichrepresentation• Broaderdefinition
– FunctionsmappingattributesoftheinputtoaBoolean/categorical/numericvalue
• Question:assumethatyouhavealexicon,containingpositiveandnegativesentimentwords.HowcanyouuseittoimproveoverBoW?
φ1(x) =1 x1 is capitalized0 otherwise
⎧⎨⎩
φk (x) =1 x contains ''good '' more than twice0 otherwise
⎧⎨⎩
Perceptron
• Oneoftheearliestlearningalgorithms– IntroducedbyRosenblatt1958tomodelneurallearning
• Goal:directlysearchforaseparatinghyperplane– Ifoneexists,perceptronwillfindit– Ifnot,…
• Online algorithm– Considersoneexampleatatime(NB– looksatentiredata)
• Errordrivenalgorithm– Updatestheweightsonlywhenamistakeismade
PerceptronIntuition
Perceptron
• Welearnf:Xà {-1,+1}representedasf=sgn{w�x)• WhereX={0,1}norX=Rn andw² Rn
• GivenLabeledexamples:{(x1,y1),(x2,y2),…(xm,ym)}
42
1. Initializew=0
2.Cyclethroughallexamples
a.Predict thelabelofinstancextobe y’=sgn{w�x)
b.Ify’≠y,update theweightvector:
w=w+ry x (r - a constant, learning rate)Otherwise,ify’=y, leaveweightsunchanged.
Rn∈
Margin
• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.
Margin
• Themarginofahyperplane foradatasetisthedistancebetweenthehyperplane andthedatapointnearesttoit.
• Themarginofadataset(𝛾)isthemaximummarginpossibleforthatdatasetusinganyweightvector.
MistakeBoundforPerceptron
• LetD={(xi,yi)}bealabeleddatasetthatisseparable• Let||xi||<Rforallexamples.• Let𝛾 bethemarginofthedatasetD.• Then,theperceptronalgorithmwillmakeatmost
R2/ 𝛾 2mistakesonthedata.
PracticalExample
46
Source: Scaling to very very large corpora for natural language disambiguationMichele Banko, Eric Brill. Microsoft Research, Redmond, WA. 2001.
Task: context sensitive spelling{principle , principal},{weather,whether}.
DeceptiveReviews
FindingDeceptiveOpinion SpambyAnyStretchoftheImagination.Ott etal.ACL2011What should your learning algorithm look at?
DeceptionClassification
Summary• ClassificationisabasictoolforNLP
– E.g.,Whatisthetopicofadocument?• Classifier:mappingfrominputtolabel
– Label:BinaryorCategorical• Wesawtwosimplelearningalgorithmsforfindingtheparametersoflinearclassificationfunctions– NaïveBayesandPerceptron
• Next:– Moresophisticatedalgorithms– Applications(or– howtogetittowork!)
Questions?