Download - Administrivia CS395T: Structured Models for NLP Lecture 2 ...gdurrett/courses/fa2018/... · Problems with Naive Bayes ‣Naive Bayes is naive, but another problem is that it’s generave:

CS395T:StructuredModelsforNLPLecture2:BinaryClassifica>on

GregDurrett

SomeslidesadaptedfromVivekSrikumar,UniversityofUtah

Administrivia

‣ Courseenrollment

‣OHsthisweek:Jifan1pm-2pmTues(today)inGDC1.304TAdesk#1Greg11am-12pmWeds+10am-11amFriinGDC3.420

‣ Readingsoncoursewebsite

‣Mini1isout,dueSeptember11

‣ Feelfreetoextendthecodeasneeded;op>mizers,featuriza>on,etc.isn’tsetinstone

ThisLecture‣ Linearclassifica>onfundamentals

‣ Threediscrimina>vemodels:logis>cregression,perceptron,SVM

‣ NaiveBayes,maximumlikelihoodingenera>vemodels

‣ Differentmo>va>onsbutverysimilarupdaterules/inference!

Classifica>on

Classifica>on

‣ Embeddatapointinafeaturespace

+++ + + +++

- - - -----

-

‣ Lineardecisionrule:

=[0.5,1.6,0.3]

[0.5,1.6,0.3,1]

x

y 2 {0, 1}

f(x) 2 Rn

‣ Datapointwithlabel

butinthislectureandareinterchangeablex

f(x)

w

>f(x) + b > 0

f(x)‣ Candeletebiasifweaugmentfeaturespace:

w

>f(x) > 0

+++ + + +++

- - - -----

-+++ + + +++

- - - -----

-???

f(x)=[x1,x2,x12,x22,x1x2]

x1

x2

+++ ++ +++

- - - -----

-

+++ ++ +++

- - ------

-

x1x2

x1

f(x)=[x1,x2]

Linearfunc>onsarepowerful!

‣ “Kerneltrick”doesthisfor“free,”butistooexpensivetouseinNLPapplica>ons,trainingisinsteadofO(n2) O(n · (num feats))

Classifica>on:Sen>mentAnalysis

thismoviewasgreat!wouldwatchagain

Nega>ve

Posi>ve

thatfilmwasawful,I’llneverwatchagain

‣ Surfacecuescanbasicallytellyouwhat’sgoingonhere:presenceorabsenceofcertainwords(great,awful)

‣ Stepstoclassifica>on:‣ Turnexampleslikethisintofeaturevectors

‣ Pickamodel/learningalgorithm

‣ Trainweightsondatatogetourclassifier

FeatureRepresenta>on

thismoviewasgreat!wouldwatchagain Posi>ve

‣ Convertthisexampletoavectorusingbag-of-wordsfeatures

‣ Requiresindexingthefeatures(mappingthemtoaxes)

[containsthe][containsa][containswas][containsmovie][containsfilm]

0 0 1 1 0

‣ Moresophis>catedfeaturemappingspossible(m-idf),aswellaslotsofotherfeatures:charactern-grams,partsofspeech,lemmas,…

posi>on0 posi>on1 posi>on2 posi>on3 posi>on4

‣ Verylargevectorspace(sizeofvocabulary),sparsefeatures…f(x)=[

…

NaiveBayes

NaiveBayes‣Datapoint,label

P (y|x) = P (y)P (x|y)P (x)

/ P (y)P (x|y)constant:irrelevantforfindingthemax

= P (y)nY

i=1

P (xi|y)

Bayes’Rule

“Naive”assump>on:

x = (x1, ..., xn) y 2 {0, 1}‣ Formulateaprobabilis>cmodelthatplacesadistribu>on

linearmodel!

P (y|x)

y

nxi

‣ Compute,predicttoclassify

P (x, y)

argmaxyP (y|x)

argmaxyP (y|x) = argmaxy logP (y|x) = argmaxy

"logP (y) +

nX

i=1

logP (xi|y)#

NaiveBayesExample

P (y|x) / [ ]itwasgreat

P (y|x) / P (y)nY

i=1

P (xi|y)

argmaxyP (y|x) = argmaxy logP (y|x) = argmaxy

"logP (y) +

nX

i=1

logP (xi|y)#

MaximumLikelihoodEs>ma>on‣Datapointsprovided(jindexesoverexamples)

‣ Findvaluesofthatmaximizedatalikelihood(genera>ve):P (y), P (xi|y)

(xj , yj)

datapoints(j) features(i)

mY

j=1

P (yj , xj) =mY

j=1

P (yj)

"nY

i=1

P (xji|yj)#

ithfeatureofjthexample

MaximumLikelihoodEs>ma>on‣ Imagineacoinflipwhichisheadswithprobabilityp

mX

j=1

logP (yj) = 3 log p+ log(1� p)

loglikelihood

p0 1

P(H)=0.75

‣ Maximumlikelihoodparametersforbinomial/mul>nomial=readcountsoffofthedata+normalize

‣Observe(H,H,H,T)andmaximizelikelihood:mY

j=1

P (yj) = p3(1� p)

‣ Easier:maximizeloglikelihood

MaximumLikelihoodEs>ma>on‣Datapointsprovided(jindexesoverexamples)

‣ Findvaluesofthatmaximizedatalikelihood(genera>ve):P (y), P (xi|y)

(xj , yj)

datapoints(j) features(i)

mY

j=1

P (yj , xj) =mY

j=1

P (yj)

"nY

i=1

P (xji|yj)#

‣ Equivalenttomaximizinglogarithmofdatalikelihood:mX

j=1

logP (yj , xj) =

mX

j=1

"logP (yj) +

nX

i=1

logP (xji|yj)#

ithfeatureofjthexample

MaximumLikelihoodforNaiveBayes

—

+thismoviewasgreat!wouldwatchagain

thatfilmwasawful,I’llneverwatchagain

—Ididn’treallylikethatmoviedryandabitdistasteful,itmissesthemark —greatpotenCalbutendedupbeingaflop —

+IlikeditwellenoughforanacConflickIexpectedagreatfilmandleEhappy +

+brilliantdirecCngandstunningvisualsP (great|+) =

1

2

P (great|�) =1

4

P (+) =1

2

P (�) =1

2

P (y|x) / P (+)P (great|+)

P (�)P (great|�)[ ] = 1/41/8[ ] = 2/3

1/3[ ]itwasgreat

P (great|�) =1

4

NaiveBayes:Summary‣Model y

nxi

P (x, y) = P (y)nY

i=1

P (xi|y)

‣ Learning:maximizebyreadingcountsoffthedata

‣ Inference

P (x, y)

argmaxy logP (y|x) = argmaxy

"logP (y) +

nX

i=1

logP (xi|y)#

‣ Alterna>vely:logP (y = +|x)� logP (y = �|x) > 0

, log

P (y = +|x)P (y = �|x) +

nX

i=1

log

P (xi|y = +)

P (xi|y = �)

> 0

ProblemswithNaiveBayes

‣NaiveBayesisnaive,butanotherproblemisthatit’sgeneraCve:spendscapacitymodelingP(x,y),whenwhatwecareaboutisP(y|x)

‣ Correlatedfeaturescompound:beauCfulandgorgeousarenotindependent!

thefilmwasbeauCful,stunningcinematographyandgorgeoussets,butboring —P (xbeautiful|+) = 0.1

P (xstunning|+) = 0.1

P (xgorgeous

|+) = 0.1

P (xbeautiful|�) = 0.01

P (xstunning|�) = 0.01

P (xgorgeous

|�) = 0.01

P (xboring

|�) = 0.1P (xboring

|+) = 0.01

‣Discrimina>vemodelsmodelP(y|x)directly(SVMs,mostneuralnetworks,…)

Logis>cRegression

Logis>cRegression

‣ Tolearnweights:maximizediscrimina>veloglikelihoodofdataP(y|x)

P (y = +|x) = logistic(w

>x)

P (y = +|x) =exp(

Pni=1 wixi)

1 + exp(

Pni=1 wixi)

L(xj , yj = +) = logP (yj = +|xj)

=

nX

i=1

wixji � log

1 + exp

nX

i=1

wixji

!!

sumoverfeatures

Logis>cRegression

@L(xj , yj)

@wi= xji �

@

@wilog

1 + exp

nX

i=1

wixji

!!

= xji �1

1 + exp (

Pni=1 wixji)

@

@wi

1 + exp

nX

i=1

wixji

!!

= xji �1

1 + exp (

Pni=1 wixji)

xji exp

nX

i=1

wixji

!

derivoflog

derivofexp

= xji � xjiexp (

Pni=1 wixji)

1 + exp (

Pni=1 wixji)

= xji(1� P (yj = +|xj))

L(xj , yj = +) = logP (yj = +|xj) =

nX

i=1

wixji � log

1 + exp

nX

i=1

wixji

!!

Logis>cRegression

IfP(+)iscloseto1,makeveryliuleupdateOtherwisemakewilookmorelikexji,whichwillincreaseP(+)

‣Gradientofwionposi>veexample

‣ Gradientofwionnega>veexample

IfP(+)iscloseto0,makeveryliuleupdateOtherwisemakewilooklesslikexji,whichwilldecreaseP(+)

xj(yj � P (yj = 1|xj))

= xji(�P (yj = +|xj))

= xji(yj � P (yj = +|xj))

‣ Cancombinethesegradientsas

‣ Recallthatyj=1forposi>veinstances,yj=0fornega>veinstances.

Regulariza>on‣ Regularizinganobjec>vecanmeanmanythings,includinganL2-normpenaltytotheweights:

mX

j=1

L(xj , yj)� �kwk22

‣ Keepingweightssmallcanpreventoverfiwng

‣ FormostoftheNLPmodelswebuild,explicitregulariza>onisn’tnecessary

‣ Earlystopping

‣ Forneuralnetworks:dropoutandgradientclipping‣ Largenumbersofsparsefeaturesarehardtooverfitinareallybadway

Logis>cRegression:Summary‣Model

‣ Learning:gradientascentonthe(regularized)discrimina>velog-likelihood

‣ Inference

argmaxyP (y|x) fundamentallysameasNaiveBayes

P (y = 1|x) � 0.5 , w

>x � 0

P (y = +|x) =exp(

Pni=1 wixi)

1 + exp(

Pni=1 wixi)

Perceptron/SVM

Perceptron

‣ Simpleerror-drivenlearningapproachsimilartologis>cregression

‣Decisionrule:

‣Guaranteedtoeventuallyseparatethedataifthedataareseparable

‣ Ifincorrect:ifposi>ve,ifnega>ve,

w w + x

w w � x

w w � xP (y = 1|x)w w + x(1� P (y = 1|x))

Logis>cRegressionw

>x > 0

SupportVectorMachines

‣Manysepara>nghyperplanes—isthereabestone?

+++ + + +++

- - - -----

-

SupportVectorMachines

‣Manysepara>nghyperplanes—isthereabestone?

+++ +

++

++

- - ------

- margin

SupportVectorMachines‣ Constraintformula>on:findwviafollowingquadra>cprogram:

Minimize

s.t.

Asasingleconstraint:

minimizingnormwithfixedmargin<=>maximizingmargin

kwk228j w

>xj � 1 if yj = 1

w

>xj �1 if yj = 0

8j (2yj � 1)(w>xj) � 1

‣Generallynosolu>on(dataisgenerallynon-separable)—needslack!

N-SlackSVMs

Minimize

s.t. 8j (2yj � 1)(w>xj) � 1� ⇠j 8j ⇠j � 0

‣ Thearea“fudgefactor”tomakeallconstraintssa>sfied⇠j

�kwk22 +mX

j=1

⇠j

‣ Takethegradientoftheobjec>ve:@

@wi⇠j = 0 if ⇠j = 0

@

@wi⇠j = (2yj � 1)xji if ⇠j > 0

= xji if yj = 1, �xji if yj = 0

‣ Looksliketheperceptron!Butupdatesmorefrequently

GradientsonPosi>veExamplesLogis>cregression

Perceptron

x(1� P (y = 1|x)) = x(1� logistic(w

>x))

x if w>x < 0, else 0

SVM(ignoringregularizer)

Hinge(SVM)

Logis>cPerceptron

0-1

Loss

w

>x

*gradientsareformaximizingthings,whichiswhytheyareflipped

x if w>x < 1, else 0

ComparingGradientUpdates(Reference)

x(y � P (y = 1|x)) x(y � logistic(w

>x))

Perceptronifclassifiedincorrectly

0else

SVMifnotclassifiedcorrectlywithmarginof1

0else

(2y � 1)x

(2y � 1)x

=

y=1forpos,0forneg

Logis>cregression(unregularized)

Op>miza>on—next>me…

‣ Rangeoftechniquesfromsimplegradientdescent(workspreuywell)tomorecomplexmethods(canworkbeuer)

‣ Mostmethodsboildownto:takeagradientandastepsize,applythegradientupdate>messtepsize,incorporatees>matedcurvatureinforma>ontomaketheupdatemoreeffec>ve

Sen>mentAnalysis

BoPang,LillianLee,ShivakumarVaithyanathan(2002)

themoviewasgrossandoverwrought,butIlikedit

thismoviewasgreat!wouldwatchagain

‣ Bag-of-wordsdoesn’tseemsufficient(discoursestructure,nega>on)

thismoviewasnotreallyveryenjoyable

‣ Therearesomewaysaroundthis:extractbigramfeaturefor“notX”forallXfollowingthenot

++—

Sen>mentAnalysis

‣ Simplefeaturesetscandopreuywell!

BoPang,LillianLee,ShivakumarVaithyanathan(2002)

Sen>mentAnalysis

WangandManning(2012)

Beforeneuralnetshadtakenoff—resultsweren’tthatgreat

NaiveBayesisdoingwell!

NgandJordan(2002)—NBcanbebeuerforsmalldata

81.589.5Kim(2014)CNNs

Recap

‣ Logis>cregression:P (y = 1|x) =

exp (

Pni=1 wixi)

(1 + exp (

Pni=1 wixi))

Gradient(unregularized):

‣ SVM:

Decisionrule:

Decisionrule:w>x � 0

P (y = 1|x) � 0.5 , w

>x � 0

(Sub)gradient(unregularized):0ifcorrectwithmarginof1,else

x(y � P (y = 1|x))

x(2y � 1)

Recap

‣ Logis>cregression,SVM,andperceptronarecloselyrelated

‣ SVMandperceptroninferencerequiretakingmaxes,logis>cregressionhasasimilarupdatebutis“so}er”duetoitsprobabilis>cnature

‣ Allgradientupdates:“makeitlookmoreliketherightthingandlesslikethewrongthing”