DanJurafsky
TextClassification
• Assigningsubjectcategories,topics,orgenres• Spamdetection• Authorshipidentification• Age/genderidentification• LanguageIdentification• Sentimentanalysis• …
DanJurafsky
TextClassification:definition
• Input:• adocumentd• afixedsetofclassesC= {c1,c2,…,cJ}
• Output:apredictedclassc Î C
DanJurafsky
ClassificationMethods:SupervisedMachineLearning
• Input:• adocumentd• afixedsetofclassesC= {c1,c2,…,cJ}• Atrainingsetofm hand-labeleddocuments(d1,c1),....,(dm,cm)
• Output:• alearnedclassifierγ:dà c
3
DanJurafsky ClassificationMethods:SupervisedMachineLearning
• Anykindofclassifier• Naïve Bayes• Logisticregression• Support-vectormachines• k-NearestNeighbors
• …
DanJurafsky
NaïveBayesIntuition
• Simple(“naïve”)classificationmethodbasedonBayesrule
• Reliesonverysimplerepresentationofdocument• Bagofwords
DanJurafsky
Thebagofwordsrepresentation
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.
γ(
)=c
DanJurafsky
Thebagofwordsrepresentation
γ(
)=cgreat 2love 2
recommend 1
laugh 1happy 1
... ...
DanJurafsky
MultinomialNaïve BayesIndependenceAssumptions
P(x1, x2,…, xn | c)
• BagofWordsassumption:Assumepositiondoesn’tmatter
• ConditionalIndependence:AssumethefeatureprobabilitiesP(xi|cj)areindependentgiventheclassc.
P(x1,…, xn | c) = P(x1 | c)•P(x2 | c)•P(x3 | c)•...•P(xn | c)
DanJurafsky
LearningtheMultinomialNaïve BayesModel
• Firstattempt:maximumlikelihoodestimates• simplyusethefrequenciesinthedata
Sec.13.3
P̂(wi | cj ) =count(wi,cj )count(w,cj )
w∈V∑
P̂(cj ) =doccount(C = cj )
Ndoc
DanJurafsky
MultinomialNaïveBayes:Learning
• CalculateP(cj) terms• Foreachcj inC do
docsj¬ alldocswithclass=cj
P(wk | cj )←nk +α
n+α |Vocabulary |P(cj )←
| docsj || total # documents|
• CalculateP(wk | cj) terms• Textj¬ singledoccontainingalldocsj• For eachwordwk inVocabulary
nk¬ #ofoccurrencesofwk inTextj
• Fromtrainingcorpus,extractVocabulary
DanJurafsky
Choosingaclass:P(c|d5)
P(j|d5) 1/4*(2/9)3 *2/9*2/9≈0.0001
Doc Words ClassTraining 1 Chinese BeijingChinese c
2 ChineseChineseShanghai c3 ChineseMacao c4 TokyoJapanChinese j
Test 5 ChineseChineseChineseTokyo Japan ?
11
ConditionalProbabilities:P(Chinese|c)=P(Tokyo|c)=P(Japan|c)=P(Chinese|j)=P(Tokyo|j)=P(Japan|j)=
Priors:P(c)=P(j)=
34 1
4
P̂(w | c) = count(w,c)+1count(c)+ |V |
P̂(c) = Nc
N
(5+1)/(8+6)=6/14=3/7(0+1)/(8+6)=1/14
(1+1)/(3+6)=2/9(0+1)/(8+6)=1/14
(1+1)/(3+6)=2/9(1+1)/(3+6)=2/9
3/4*(3/7)3 *1/14*1/14≈0.0003
µ
µ
DanJurafsky
UnderflowPrevention:logspace
• Multiplyinglotsofprobabilitiescanresultinfloating-pointunderflow.• Sincelog(xy)=log(x)+log(y)
• Bettertosumlogsofprobabilitiesinsteadofmultiplyingprobabilities.• Classwithhighestun-normalizedlogprobabilityscoreisstillmostprobable.
• Modelisnowjustmaxofsumofweights
cNB = argmaxc j∈C
logP(cj )+ logP(xi | cj )i∈positions∑
DanJurafsky
Summary:NaiveBayesisNotSoNaive
• VeryFast,lowstoragerequirements• RobusttoIrrelevantFeatures
IrrelevantFeaturescanceleachotherwithoutaffectingresults
• VerygoodindomainswithmanyequallyimportantfeaturesDecisionTreessufferfromfragmentation insuchcases– especiallyiflittledata
• Optimaliftheindependenceassumptionshold:Ifassumedindependenceiscorrect,thenitistheBayesOptimalClassifierforproblem
• Agooddependablebaselinefortextclassification• Butwewillseeotherclassifiersthatgivebetteraccuracy
TextClassification:Evaluation
DanJurafsky
The2-by-2contingencytable
correct notcorrectselected tp fp
notselected fn tn
DanJurafsky
Precisionandrecall
• Precision:%ofselecteditemsthatarecorrectRecall:%ofcorrectitemsthatareselected
correct notcorrectselected tp fp
notselected fn tn
DanJurafsky
Acombinedmeasure:F
• AcombinedmeasurethatassessestheP/RtradeoffisFmeasure(weightedharmonicmean):
• Theharmonicmeanisaveryconservativeaverage;seeIIR§8.3
• PeopleusuallyusebalancedF1measure• i.e.,withb =1(thatis,a =½): F =2PR/(P+R)
RPPR
RP
F+
+=
−+= 2
2 )1(1)1(1
1ββ
αα
DanJurafsky
18
MoreThanTwoClasses:Setsofbinaryclassifiers
• Dealingwithany-oformultivalue classification• Adocumentcanbelongto0,1,or>1classes.
• Foreachclassc∈C• Buildaclassifierγc todistinguishc fromallotherclassesc’∈C
• Giventestdocd,• Evaluateitformembershipineachclassusingeachγc• d belongstoany classforwhich γc returnstrue
Sec.14.5
DanJurafsky
19
MoreThanTwoClasses:Setsofbinaryclassifiers
• One-oformultinomialclassification• Classesaremutuallyexclusive:eachdocumentinexactlyoneclass
• Foreachclassc∈C• Buildaclassifierγc todistinguishc fromallotherclassesc’∈C
• Giventestdocd,• Evaluateitformembershipineachclassusingeachγc• d belongstotheone classwithmaximumscore
Sec.14.5
DanJurafsky
Confusionmatrixc• Foreachpairofclasses<c1,c2>howmanydocumentsfromc1
wereincorrectlyassignedtoc2?• c3,2:90wheatdocumentsincorrectlyassignedtopoultry
20
Docsintestset AssignedUK
Assignedpoultry
Assignedwheat
Assignedcoffee
Assignedinterest
Assignedtrade
TrueUK 95 1 13 0 1 0
Truepoultry 0 1 0 0 0 0
Truewheat 10 90 0 1 0 0
Truecoffee 0 0 0 34 3 7
Trueinterest - 1 2 13 26 5
Truetrade 0 0 2 14 5 10
DanJurafsky
21
Perclassevaluationmeasures
Recall:Fractionofdocsinclassi classifiedcorrectly:
Precision:Fractionofdocsassignedclassi thatare
actuallyaboutclassi:
Accuracy:(1- errorrate)Fractionofdocsclassifiedcorrectly:
ciii∑
ciji∑
j∑
ciic ji
j∑
ciicij
j∑
Sec. 15.2.4
DanJurafsky
22
Micro- vs.Macro-Averaging
• Ifwehavemorethanoneclass,howdowecombinemultipleperformancemeasuresintoonequantity?
• Macroaveraging:Computeperformanceforeachclass,thenaverage.
• Microaveraging:Collectdecisionsforallclasses,computecontingencytable,evaluate.
Sec. 15.2.4
DanJurafsky
23
Micro- vs.Macro-Averaging:Example
Truth:yes
Truth:no
Classifier:yes 10 10
Classifier:no 10 970
Truth:yes
Truth:no
Classifier:yes 90 10
Classifier:no 10 890
Truth:yes
Truth:no
Classifier:yes 100 20
Classifier:no 20 1860
Class1 Class2 MicroAve.Table
Sec.15.2.4
• Macroaveraged precision:(0.5+0.9)/2=0.7• Microaveraged precision:100/120=.83• Microaveraged scoreisdominatedbyscoreoncommonclasses
DanJurafsky
DevelopmentTestSetsandCross-validation
• Metric:P/R/F1orAccuracy• Unseentestset
• avoidoverfitting (‘tuningtothetestset’)• moreconservativeestimateofperformance
• Cross-validationovermultiplesplits• Handlesamplingerrorsfromdifferentdatasets
• Poolresultsovereachsplit• Computepooleddev setperformance
Trainingset Development Test Set TestSet
TestSet
TrainingSet
TrainingSetDev Test
TrainingSet
Dev Test
Dev Test