CS395T:StructuredModelsforNLPLecture2:BinaryClassifica>on
GregDurrett
SomeslidesadaptedfromVivekSrikumar,UniversityofUtah
Administrivia
‣ Courseenrollment
‣OHsthisweek:Jifan1pm-2pmTues(today)inGDC1.304TAdesk#1Greg11am-12pmWeds+10am-11amFriinGDC3.420
‣ Readingsoncoursewebsite
‣Mini1isout,dueSeptember11
‣ Feelfreetoextendthecodeasneeded;op>mizers,featuriza>on,etc.isn’tsetinstone
ThisLecture‣ Linearclassifica>onfundamentals
‣ Threediscrimina>vemodels:logis>cregression,perceptron,SVM
‣ NaiveBayes,maximumlikelihoodingenera>vemodels
‣ Differentmo>va>onsbutverysimilarupdaterules/inference!
Classifica>on
Classifica>on
‣ Embeddatapointinafeaturespace
+++ + + +++
- - - -----
-
‣ Lineardecisionrule:
=[0.5,1.6,0.3]
[0.5,1.6,0.3,1]
x
y 2 {0, 1}
f(x) 2 Rn
‣ Datapointwithlabel
butinthislectureandareinterchangeablex
f(x)
w
>f(x) + b > 0
f(x)‣ Candeletebiasifweaugmentfeaturespace:
w
>f(x) > 0
+++ + + +++
- - - -----
-+++ + + +++
- - - -----
-???
f(x)=[x1,x2,x12,x22,x1x2]
x1
x2
+++ ++ +++
- - - -----
-
+++ ++ +++
- - ------
-
x1x2
x1
f(x)=[x1,x2]
Linearfunc>onsarepowerful!
‣ “Kerneltrick”doesthisfor“free,”butistooexpensivetouseinNLPapplica>ons,trainingisinsteadofO(n2) O(n · (num feats))
Classifica>on:Sen>mentAnalysis
thismoviewasgreat!wouldwatchagain
Nega>ve
Posi>ve
thatfilmwasawful,I’llneverwatchagain
‣ Surfacecuescanbasicallytellyouwhat’sgoingonhere:presenceorabsenceofcertainwords(great,awful)
‣ Stepstoclassifica>on:‣ Turnexampleslikethisintofeaturevectors
‣ Pickamodel/learningalgorithm
‣ Trainweightsondatatogetourclassifier
FeatureRepresenta>on
thismoviewasgreat!wouldwatchagain Posi>ve
‣ Convertthisexampletoavectorusingbag-of-wordsfeatures
‣ Requiresindexingthefeatures(mappingthemtoaxes)
[containsthe][containsa][containswas][containsmovie][containsfilm]
0 0 1 1 0
‣ Moresophis>catedfeaturemappingspossible(m-idf),aswellaslotsofotherfeatures:charactern-grams,partsofspeech,lemmas,…
posi>on0 posi>on1 posi>on2 posi>on3 posi>on4
‣ Verylargevectorspace(sizeofvocabulary),sparsefeatures…f(x)=[
…
NaiveBayes
NaiveBayes‣Datapoint,label
P (y|x) = P (y)P (x|y)P (x)
/ P (y)P (x|y)constant:irrelevantforfindingthemax
= P (y)nY
i=1
P (xi|y)
Bayes’Rule
“Naive”assump>on:
x = (x1, ..., xn) y 2 {0, 1}‣ Formulateaprobabilis>cmodelthatplacesadistribu>on
linearmodel!
P (y|x)
y
nxi
‣ Compute,predicttoclassify
P (x, y)
argmaxyP (y|x)
argmaxyP (y|x) = argmaxy logP (y|x) = argmaxy
"logP (y) +
nX
i=1
logP (xi|y)#
NaiveBayesExample
P (y|x) / [ ]itwasgreat
P (y|x) / P (y)nY
i=1
P (xi|y)
argmaxyP (y|x) = argmaxy logP (y|x) = argmaxy
"logP (y) +
nX
i=1
logP (xi|y)#
MaximumLikelihoodEs>ma>on‣Datapointsprovided(jindexesoverexamples)
‣ Findvaluesofthatmaximizedatalikelihood(genera>ve):P (y), P (xi|y)
(xj , yj)
datapoints(j) features(i)
mY
j=1
P (yj , xj) =mY
j=1
P (yj)
"nY
i=1
P (xji|yj)#
ithfeatureofjthexample
MaximumLikelihoodEs>ma>on‣ Imagineacoinflipwhichisheadswithprobabilityp
mX
j=1
logP (yj) = 3 log p+ log(1� p)
loglikelihood
p0 1
P(H)=0.75
‣ Maximumlikelihoodparametersforbinomial/mul>nomial=readcountsoffofthedata+normalize
‣Observe(H,H,H,T)andmaximizelikelihood:mY
j=1
P (yj) = p3(1� p)
‣ Easier:maximizeloglikelihood
MaximumLikelihoodEs>ma>on‣Datapointsprovided(jindexesoverexamples)
‣ Findvaluesofthatmaximizedatalikelihood(genera>ve):P (y), P (xi|y)
(xj , yj)
datapoints(j) features(i)
mY
j=1
P (yj , xj) =mY
j=1
P (yj)
"nY
i=1
P (xji|yj)#
‣ Equivalenttomaximizinglogarithmofdatalikelihood:mX
j=1
logP (yj , xj) =
mX
j=1
"logP (yj) +
nX
i=1
logP (xji|yj)#
ithfeatureofjthexample
MaximumLikelihoodforNaiveBayes
—
+thismoviewasgreat!wouldwatchagain
thatfilmwasawful,I’llneverwatchagain
—Ididn’treallylikethatmoviedryandabitdistasteful,itmissesthemark —greatpotenCalbutendedupbeingaflop —
+IlikeditwellenoughforanacConflickIexpectedagreatfilmandleEhappy +
+brilliantdirecCngandstunningvisualsP (great|+) =
1
2
P (great|�) =1
4
P (+) =1
2
P (�) =1
2
P (y|x) / P (+)P (great|+)
P (�)P (great|�)[ ] = 1/41/8[ ] = 2/3
1/3[ ]itwasgreat
P (great|�) =1
4
NaiveBayes:Summary‣Model y
nxi
P (x, y) = P (y)nY
i=1
P (xi|y)
‣ Learning:maximizebyreadingcountsoffthedata
‣ Inference
P (x, y)
argmaxy logP (y|x) = argmaxy
"logP (y) +
nX
i=1
logP (xi|y)#
‣ Alterna>vely:logP (y = +|x)� logP (y = �|x) > 0
, log
P (y = +|x)P (y = �|x) +
nX
i=1
log
P (xi|y = +)
P (xi|y = �)
> 0
ProblemswithNaiveBayes
‣NaiveBayesisnaive,butanotherproblemisthatit’sgeneraCve:spendscapacitymodelingP(x,y),whenwhatwecareaboutisP(y|x)
‣ Correlatedfeaturescompound:beauCfulandgorgeousarenotindependent!
thefilmwasbeauCful,stunningcinematographyandgorgeoussets,butboring —P (xbeautiful|+) = 0.1
P (xstunning|+) = 0.1
P (xgorgeous
|+) = 0.1
P (xbeautiful|�) = 0.01
P (xstunning|�) = 0.01
P (xgorgeous
|�) = 0.01
P (xboring
|�) = 0.1P (xboring
|+) = 0.01
‣Discrimina>vemodelsmodelP(y|x)directly(SVMs,mostneuralnetworks,…)
Logis>cRegression
Logis>cRegression
‣ Tolearnweights:maximizediscrimina>veloglikelihoodofdataP(y|x)
P (y = +|x) = logistic(w
>x)
P (y = +|x) =exp(
Pni=1 wixi)
1 + exp(
Pni=1 wixi)
L(xj , yj = +) = logP (yj = +|xj)
=
nX
i=1
wixji � log
1 + exp
nX
i=1
wixji
!!
sumoverfeatures
Logis>cRegression
@L(xj , yj)
@wi= xji �
@
@wilog
1 + exp
nX
i=1
wixji
!!
= xji �1
1 + exp (
Pni=1 wixji)
@
@wi
1 + exp
nX
i=1
wixji
!!
= xji �1
1 + exp (
Pni=1 wixji)
xji exp
nX
i=1
wixji
!
derivoflog
derivofexp
= xji � xjiexp (
Pni=1 wixji)
1 + exp (
Pni=1 wixji)
= xji(1� P (yj = +|xj))
L(xj , yj = +) = logP (yj = +|xj) =
nX
i=1
wixji � log
1 + exp
nX
i=1
wixji
!!
Logis>cRegression
IfP(+)iscloseto1,makeveryliuleupdateOtherwisemakewilookmorelikexji,whichwillincreaseP(+)
‣Gradientofwionposi>veexample
‣ Gradientofwionnega>veexample
IfP(+)iscloseto0,makeveryliuleupdateOtherwisemakewilooklesslikexji,whichwilldecreaseP(+)
xj(yj � P (yj = 1|xj))
= xji(�P (yj = +|xj))
= xji(yj � P (yj = +|xj))
‣ Cancombinethesegradientsas
‣ Recallthatyj=1forposi>veinstances,yj=0fornega>veinstances.
Regulariza>on‣ Regularizinganobjec>vecanmeanmanythings,includinganL2-normpenaltytotheweights:
mX
j=1
L(xj , yj)� �kwk22
‣ Keepingweightssmallcanpreventoverfiwng
‣ FormostoftheNLPmodelswebuild,explicitregulariza>onisn’tnecessary
‣ Earlystopping
‣ Forneuralnetworks:dropoutandgradientclipping‣ Largenumbersofsparsefeaturesarehardtooverfitinareallybadway
Logis>cRegression:Summary‣Model
‣ Learning:gradientascentonthe(regularized)discrimina>velog-likelihood
‣ Inference
argmaxyP (y|x) fundamentallysameasNaiveBayes
P (y = 1|x) � 0.5 , w
>x � 0
P (y = +|x) =exp(
Pni=1 wixi)
1 + exp(
Pni=1 wixi)
Perceptron/SVM
Perceptron
‣ Simpleerror-drivenlearningapproachsimilartologis>cregression
‣Decisionrule:
‣Guaranteedtoeventuallyseparatethedataifthedataareseparable
‣ Ifincorrect:ifposi>ve,ifnega>ve,
w w + x
w w � x
w w � xP (y = 1|x)w w + x(1� P (y = 1|x))
Logis>cRegressionw
>x > 0
SupportVectorMachines
‣Manysepara>nghyperplanes—isthereabestone?
+++ + + +++
- - - -----
-
SupportVectorMachines
‣Manysepara>nghyperplanes—isthereabestone?
+++ +
++
++
- - ------
- margin
SupportVectorMachines‣ Constraintformula>on:findwviafollowingquadra>cprogram:
Minimize
s.t.
Asasingleconstraint:
minimizingnormwithfixedmargin<=>maximizingmargin
kwk228j w
>xj � 1 if yj = 1
w
>xj �1 if yj = 0
8j (2yj � 1)(w>xj) � 1
‣Generallynosolu>on(dataisgenerallynon-separable)—needslack!
N-SlackSVMs
Minimize
s.t. 8j (2yj � 1)(w>xj) � 1� ⇠j 8j ⇠j � 0
‣ Thearea“fudgefactor”tomakeallconstraintssa>sfied⇠j
�kwk22 +mX
j=1
⇠j
‣ Takethegradientoftheobjec>ve:@
@wi⇠j = 0 if ⇠j = 0
@
@wi⇠j = (2yj � 1)xji if ⇠j > 0
= xji if yj = 1, �xji if yj = 0
‣ Looksliketheperceptron!Butupdatesmorefrequently
GradientsonPosi>veExamplesLogis>cregression
Perceptron
x(1� P (y = 1|x)) = x(1� logistic(w
>x))
x if w>x < 0, else 0
SVM(ignoringregularizer)
Hinge(SVM)
Logis>cPerceptron
0-1
Loss
w
>x
*gradientsareformaximizingthings,whichiswhytheyareflipped
x if w>x < 1, else 0
ComparingGradientUpdates(Reference)
x(y � P (y = 1|x)) x(y � logistic(w
>x))
Perceptronifclassifiedincorrectly
0else
SVMifnotclassifiedcorrectlywithmarginof1
0else
(2y � 1)x
(2y � 1)x
=
y=1forpos,0forneg
Logis>cregression(unregularized)
Op>miza>on—next>me…
‣ Rangeoftechniquesfromsimplegradientdescent(workspreuywell)tomorecomplexmethods(canworkbeuer)
‣ Mostmethodsboildownto:takeagradientandastepsize,applythegradientupdate>messtepsize,incorporatees>matedcurvatureinforma>ontomaketheupdatemoreeffec>ve
Sen>mentAnalysis
BoPang,LillianLee,ShivakumarVaithyanathan(2002)
themoviewasgrossandoverwrought,butIlikedit
thismoviewasgreat!wouldwatchagain
‣ Bag-of-wordsdoesn’tseemsufficient(discoursestructure,nega>on)
thismoviewasnotreallyveryenjoyable
‣ Therearesomewaysaroundthis:extractbigramfeaturefor“notX”forallXfollowingthenot
++—
Sen>mentAnalysis
‣ Simplefeaturesetscandopreuywell!
BoPang,LillianLee,ShivakumarVaithyanathan(2002)
Sen>mentAnalysis
WangandManning(2012)
Beforeneuralnetshadtakenoff—resultsweren’tthatgreat
NaiveBayesisdoingwell!
NgandJordan(2002)—NBcanbebeuerforsmalldata
81.589.5Kim(2014)CNNs
Recap
‣ Logis>cregression:P (y = 1|x) =
exp (
Pni=1 wixi)
(1 + exp (
Pni=1 wixi))
Gradient(unregularized):
‣ SVM:
Decisionrule:
Decisionrule:w>x � 0
P (y = 1|x) � 0.5 , w
>x � 0
(Sub)gradient(unregularized):0ifcorrectwithmarginof1,else
x(y � P (y = 1|x))
x(2y � 1)
Recap
‣ Logis>cregression,SVM,andperceptronarecloselyrelated
‣ SVMandperceptroninferencerequiretakingmaxes,logis>cregressionhasasimilarupdatebutis“so}er”duetoitsprobabilis>cnature
‣ Allgradientupdates:“makeitlookmoreliketherightthingandlesslikethewrongthing”