10/29/19
1
Text classification: The Naive Bayes Classifier
Chapter 4 in Martin/Jurafsky
Is this spam?
10/29/19
2
Who wrote which Federalist papers? • 1787-8: anonymous essays try to convince New York to ratify
U.S Constitution: Jay, Madison, Hamilton.
• Authorship of 12 of the letters in dispute
• 1963: solved by Mosteller and Wallace using Bayesian methods
James Madison Alexander Hamilton
-2851$/�2)�7+(� $0(5,&$1�67$7,67,&$/� $662&,$7,21�
$1XPEHU����� -81(������� 9ROXPH����
,1)(5(1&(� ,1� $1�$87+256,,,3� 352%/(0����
$�FRPSDUDWLYH�VWXG\�RI�GLVFULPLQDWLRQ�PHWKRGV�DSSOLHG�WR�WKH�DXWKRUVKLS�RI�WKH�GLVSXWHG�)HGHUDOLVW�SDSHUV�
)5('(5,&.�0267(//(5�+DUYDUG�8QLYHUVLW\�
DQG�&HQWHU�IRU�$GYDQFHG�6WXG\�LQ�WKH�%HKDYLRUDO�6FLHQFHV�
$1'�
'$9,'�/��:$//$&(�8QLYHUVLW\�RI�&KLFDJR�
7KLV�VWXG\�KDV�IRXU�SXUSRVHV��WR�SURYLGH�D�FRPSDULVRQ�RI�GLVFULPL��QDWLRQ�PHWKRGV��WR�H[SORUH�WKH�SUREOHPV�SUHVHQWHG�E\�WHFKQLTXHV�EDVHG�VWURQJO\�RQ�%D\HV�WKHRUHP�ZKHQ�WKH\�DUH�XVHG�LQ�D�GDWD�DQDO\VLV�RI�ODUJH�VFDOH��WR�VROYH�WKH�DXWKRUVKLS�TXHVWLRQ�RI�7KH�)HGHUDOLVW�SDSHUV��DQG�WR�SURSRVH�URXWLQH�PHWKRGV�IRU�VROYLQJ�RWKHU�DXWKRUVKLS�SUREOHPV��:RUG�FRXQWV�DUH�WKH�YDULDEOHV�XVHG�IRU�GLVFULPLQDWLRQ��6LQFH�WKH�
WRSLF�ZULWWHQ�DERXW�KHDYLO\�LQIOXHQFHV�WKH�UDWH�ZLWK�ZKLFK�D�ZRUG�LV�XVHG��FDUH�LQ�VHOHFWLRQ�RI�ZRUGV�LV�QHFHVVDU\��7KH�ILOOHU�ZRUGV�RI�WKH�ODQJXDJH�VXFK�DV� DQ��RI��DQG�XSRQ��DQG��PRUH�JHQHUDOO\��DUWLFOHV��SUHSRVLWLRQV��DQG�FRQMXQFWLRQV�SURYLGH�IDLUO\�VWDEOH�UDWHV��ZKHUHDV�PRUH�PHDQLQJIXO�ZRUGV�OLNH�ZDU��H[HFXWLYH��DQG�OHJLVODWXUH�GR�QRW��$IWHU�DQ�LQYHVWLJDWLRQ�RI�WKH�GLVWULEXWLRQ�RI�WKHVH�FRXQWV��WKH�DXWKRUV�
H[HFXWH�DQ�DQDO\VLV�HPSOR\LQJ�WKH�XVXDO�GLVFULPLQDQW�IXQFWLRQ�DQG�DQ�DQDO\VLV�EDVHG�RQ�%D\HVLDQ�PHWKRGV��7KH�FRQFOXVLRQV�DERXW�WKH�DXWKRU��VKLS�SUREOHP�DUH�WKDW�0DGLVRQ�UDWKHU�WKDQ�+DPLOWRQ�ZURWH�DOO����RI�WKH�GLVSXWHG�SDSHUV��7KH�ILQGLQJV�DERXW�PHWKRGV�DUH�SUHVHQWHG�LQ�WKH�FORVLQJ�VHFWLRQ�RQ�
FRQFOXVLRQV��7KLV�UHSRUW��VXPPDUL]LQJ�DQG�DEEUHYLDWLQJ�D�IRUWKFRPLQJ�PRQRJUDSK�
>�@��JLYHV�VRPH�RI�WKH�UHVXOWV�EXW�YHU\�OLWWOH�RI�WKHLU�HPSLULFDO�DQG�WKHRUHWLFDO�IRXQGDWLRQ��,W�WUHDWV�WZR�RI�WKH�IRXU�PDLQ�VWXGLHV�SUHVHQWHG�LQ�WKH�PRQRJUDSK��DQG�QRQH�RI�WKH�VLGH�VWXGLHV��
,�7KLV�ZRUN�KDV�EHHQ�IDFLOLDWHG�E\�JUDQWV�IURP�7KH�)RUG�)RXQGDWLRQ��WKH�5RFNHIHOOHU�)RXQGDWLRQ��DQG�IURP�WKH�1DWLRQDO�6FLHQFH�)RXQGDWLRQ�16)�*�������DQG�*��������FRQWUDFWV�ZLWK�WKH�2IILFH�RI�1DYDO�5HVHDUFK�1RQU����������DQG�����������DQG�WKH�/DERUDWRU\�RI�6RFLDO�5HODWLRQV��+DUYDUG�8QLYHUVLW\��7KH�ZRUN�ZDV�GRQH�LQ�SDUW�DW�WKH�0DVVDFKXVHWWV�,QVWLWXWH�RI�7HFKQRORJ\�&RPSXWDWLRQ�&HQWHU��&DPEULGJH��0DVVDFKXVHWWV��DQG�DW�WKH�&HQWHU�IRU�$GYDQFHG�6WXG\�LQ�WKH�%HKDYLRUDO�6FLHQFHV��6WDQIRUG��&DOLIRUQLD��3HUPLVVLRQ�LV�JUDQWHG�IRU�UHSURGXFWLRQ�LQ�ZKROH�RU�LQ�SDUW�IRU�SXUSRVHV�RI�WKH�8QLWHG�6WDWHV�*RYHUQPHQW��
��3UHVHQWHG�DW�D�VHVVLRQ�RI�6SHFLDO�3DSHUV�,QYLWHG�E\�WKH�3UHVLGHQWV�RI�7KH�$PHULFDQ�6WDWLVWLFDO�$VVRFLDWLRQ��7KH�%LRPHWULF�6RFLHW\��(1$5���DQG�7KH�,QVWLWXWH�RI�0DWKHPDWLFDO�6WDWLVWLFV�DW�WKH�VWDWLVWLFDO�PHHWLQJV�LQ�0LQQHDSROLV��0LQQHVRWD��6HSWHPEHU����������
����
This content downloaded from 128.122.184.172 on Mon, 19 Aug 2013 18:55:54 PMAll use subject to JSTOR Terms and Conditions
Positive or negative movie review?
• unbelievably disappointing
• Full of zany characters and richly applied satire, and some great plot twists
• this is the greatest screwball comedy ever filmed
• It was pathetic. The worst part about it was the boxing scenes.
10/29/19
3
What is the subject of this article?
• Antogonists and Inhibitors
• Blood Supply
• Chemistry
• Drug Therapy
• Embryology
• Epidemiology
• …
MeSH Subject Category Hierarchy
?
MEDLINE article
Text classification
• Assigning subject categories, topics, or genres
• Spam detection
• Authorship identification
• Age/gender identification
• Language identification
• Sentiment analysis
• …
10/29/19
4
Text classification: problem definition
• Input: – a document d
– a fixed set of classes C = {c1, c2,…, cJ}
• Output: a predicted class c ∈ C (and a level of confidence in the prediction)
Classification methods: Hand-coded rules
• Rules based on combinations of words or other features
– spam: black-list-address OR (“dollars” AND “have been selected”)
• Accuracy can be high
– If rules carefully refined by expert
• Difficulty?
10/29/19
5
Classification methods: Hand-coded rules
• Rules based on combinations of words or other features
– spam: black-list-address OR (“dollars” AND“have been selected”)
• Accuracy can be high
– If rules carefully refined by expert
• Building and maintaining these rules is time consuming
Classification using supervised machine learning
• Input: – a document d
– a fixed set of classes C = {c1, c2,…, cJ}
– A training set of N hand-labeled documents (d1,c1),....,(dN,cN)
• Output: – a learned classifier which is a mapping from the set of
documents to the set of labels
10/29/19
6
Classification methods: Supervised machine learning • Any kind of classifier can be used for this task:
– Naïve Bayes
– Logistic regression
– Support-vector machines
– Neural networks
– …
Naïve Bayes Intuition
• Simple (“naïve”) classification method based on Bayes rule
• Relies on very simple representation of document
– Bag of words
10/29/19
7
The "bag of words" representation
it
it
itit
it
it
I
I
I
I
I
love
recommend
movie
thethe
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romanticof
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
areanyone
adventure
always
again
about
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet!
it Ithetoandseenyetwouldwhimsicaltimessweetsatiricaladventuregenrefairyhumorhavegreat…
6 54332111111111111…
it
it
itit
it
it
I
I
I
I
I
love
recommend
movie
thethe
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romanticof
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
areanyone
adventure
always
again
about
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet!
it Ithetoandseenyetwouldwhimsicaltimessweetsatiricaladventuregenrefairyhumorhavegreat…
6 54332111111111111…
it
it
itit
it
it
I
I
I
I
I
love
recommend
movie
thethe
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romanticof
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
areanyone
adventure
always
again
about
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet!
it Ithetoandseenyetwouldwhimsicaltimessweetsatiricaladventuregenrefairyhumorhavegreat…
6 54332111111111111…
Classification with probabilistic models
• To classify a document d choose the class that has the highest probability:
4.1 • NAIVE BAYES CLASSIFIERS 65
4.1 Naive Bayes Classifiers
In this section we introduce the multinomial naive Bayes classifier, so called be-naive Bayesclassifier
cause it is a Bayesian classifier that makes a simplifying (naive) assumption abouthow the features interact.
The intuition of the classifier is shown in Fig. 4.1. We represent a text documentas if it were a bag-of-words, that is, an unordered set of words with their positionbag-of-words
ignored, keeping only their frequency in the document. In the example in the figure,instead of representing the word order in all the phrases like “I love this movie” and“I would recommend it”, we simply note that the word I occurred 5 times in theentire excerpt, the word it 6 times, the words love, recommend, and movie once, andso on.
it
it
itit
it
it
I
I
I
I
I
love
recommend
movie
thethe
the
the
to
to
to
and
andand
seen
seen
yet
would
with
who
whimsical
whilewhenever
times
sweet
several
scenes
satirical
romanticof
manages
humor
have
happy
fun
friend
fairy
dialogue
but
conventions
areanyone
adventure
always
again
about
I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet!
it Ithetoandseenyetwouldwhimsicaltimessweetsatiricaladventuregenrefairyhumorhavegreat…
6 54332111111111111…
Figure 4.1 Intuition of the multinomial naive Bayes classifier applied to a movie review. The position of thewords is ignored (the bag of words assumption) and we make use of the frequency of each word.
Naive Bayes is a probabilistic classifier, meaning that for a document d, out ofall classes c 2 C the classifier returns the class c which has the maximum posteriorprobability given the document. In Eq. 4.1 we use the hat notation ˆ to mean “ourˆestimate of the correct class”.
c = argmaxc2C
P(c|d) (4.1)
This idea of Bayesian inference has been known since the work of Bayes (1763),Bayesianinference
and was first applied to text classification by Mosteller and Wallace (1964). The in-tuition of Bayesian classification is to use Bayes’ rule to transform Eq. 4.1 into otherprobabilities that have some useful properties. Bayes’ rule is presented in Eq. 4.2;it gives us a way to break down any conditional probability P(x|y) into three other
10/29/19
8
Bayes’ rule From the product rule:
P(x, y) = P(y|x) P(x)
and:
P(x, y) = P(x|y) P(y)
Therefore:
This is known as Bayes’ rule
P (y|x) = P (x|y)P (y)
P (x)
Bayes’ rule for classification of documents
P(c | d) = P(d | c)P(c)P(d)
• For a document d and a class c
10/29/19
9
MAP classification
cMAP = argmaxc∈C
P(c | d)
= argmaxc∈C
P(d | c)P(c)P(d)
= argmaxc∈C
P(d | c)P(c)
MAP is “maximum a posteriori” = most likely class
Bayes Rule
Dropping the denominator
MAP classification
cMAP = argmaxc∈C
P(d | c)P(c)
Document d represented as features x1,...,xn
= argmaxc∈C
P(x1, x2,…, xn | c)P(c)
10/29/19
10
MAP classification
How often does this class occur?
cMAP = argmaxc∈C
P(x1, x2,…, xn | c)P(c)
O(|X|n|C|)parameters
We can just count the relative frequencies in a corpus
Couldonlybees@matedifavery,verylargenumberoftrainingexampleswasavailable.
Naïve Bayes independence assumption
P(x1, x2,…, xn | c)• Conditional Independence: Assume the feature
probabilities P(xi|cj) are independent given the class c:
P(x1,…, xn | c) = P(x1 | c)P(x2 | c)P(x3 | c),...,P(xn | c)
10/29/19
11
Multinomial Naïve Bayes
• Conditional Independence: Assume the feature probabilities P(xi|cj) are independent given the class c.
• Bag of Words assumption: position doesn’t matter; the variables represent counts/presence absence of a word in a document
P(x1,…, xn | c) = P(x1 | c)P(x2 | c)P(x3 | c),...,P(xn | c)
Multinomial Naïve Bayes Classifier
cMAP = argmaxc∈C
P(x1, x2,…, xn | c)P(c)
cNB = argmaxc∈C
P(c) P(xi | c)i∏
How many parameters in a model with a vocabulary V?
10/29/19
12
Learning the Multinomial Naïve Bayes Model
• First attempt: maximum likelihood estimates – simply use the frequencies in the data
Sec.13.3
P(wi | c) =count(wi,c)count(w,c)
w∈V∑
P(c) = Nc
Ndoc
fraction of documents in class c
fraction of times word w appears among all words in documents of topic c V – the vocabulary
Maximum likelihood • Fit a probabilistic model P(x|θ) to data
– Estimate θ
• Given independent identically distributed (i.i.d.) data X = (x1, x2, …, xn)
– Likelihood
– Log likelihood
• Maximum likelihood solution: parameters θ that maximize ln P(X|θ)
P (X|✓) = P (x1|✓)P (x2|✓), . . . , P (xn|✓)
lnP (X|✓) =nX
i=1
lnP (xi|✓)
10/29/19
13
Example • Example: coin toss
• Estimate the probability p that a coin lands “Heads” using the result of n coin tosses, h of which resulted in heads.
• The likelihood of the data:
• Log likelihood:
• Taking a derivative and setting to 0:
P (X|✓) = ph(1� p)n�h
lnP (X|✓) = h ln p+ (n� h) ln(1� p)
@ lnP (X|✓)@p
=h
p� (n� h)
(1� p)= 0
) p =h
n
• Create mega-document for topic c by concatenating all docs in this topic – Use frequency of w in mega-document
Parameter estimation
P(wi | c) =count(wi,c)count(w,c)
w∈V∑
10/29/19
14
Problem with zeros
• What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)?
• Zero probabilities cannot be conditioned away, no matter the other evidence!
P("fantastic" positive) = count("fantastic", positive)count(w, positive
w∈V∑ )
= 0
cMAP = argmaxc P(c) P(xi | c)i∏
Sec.13.3
Laplace (add-1) smoothing for Naïve Bayes
P(wi | c) =count(wi,c)+1count(w,c)+1( )
w∈V∑
=count(wi,c)+1
count(w,cw∈V∑ )⎛
⎝⎜
⎞
⎠⎟ + V
10/29/19
15
Multinomial Naïve Bayes: Learning
• Calculate P(c) terms
P(wk | c)←nk +αn+α |V |
P(c)← Nc
Ndoc
• Calculate P(wk | c) terms • For each word wk in the vocabulary V: nk ← # of occurrences of wk in documents in class c n ← total number of words in documents in class c
• From training corpus, extract vocabulary V
making a prediction: P(c|d5) P(j|d5)
1/4 * (2/9)3 * 2/9 * 2/9 ≈ 0.0001
Doc Words Class
Training 1 Chinese Beijing Chinese c 2 Chinese Chinese Shanghai c 3 Chinese Macao c 4 Tokyo Japan Chinese j
Test 5 Chinese Chinese Chinese Tokyo Japan ?
Conditional Probabilities: P(Chinese|c) = P(Tokyo|c) = P(Japan|c) = P(Chinese|j) = P(Tokyo|j) = P(Japan|j) =
Priors: P(c)=
P(j)=
3 4 1
4
(5+1) / (8+6) = 6/14 = 3/7 (0+1) / (8+6) = 1/14
(1+1) / (3+6) = 2/9 (0+1) / (8+6) = 1/14
(1+1) / (3+6) = 2/9 (1+1) / (3+6) = 2/9
3/4 * (3/7)3 * 1/14 * 1/14 ≈ 0.0003
∝
∝
10/29/19
16
Multinomial Naïve Bayes as a generative model
c=China
X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds
Naïve Bayes and Language Modeling
• Naïve Bayes classifiers can use any sort of feature – URL, email address, network features etc.
• But if, as in the previous slides – We use only word features
• Then – Naïve Bayes is related to language models.
10/29/19
17
Each class is a unigram language model
• Assigning each word: P(word | c)
• Assigning each sentence: P(s|c)=Π P(word|c)
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
…
I love this fun film
0.1 0.1 .05 0.01 0.1
Class pos
P(s|pos)=0.0000005
Sec.13.2.1
Naïve Bayes as a Language Model
• Which class assigns the higher probability to s?
0.1 I
0.1 love
0.01 this
0.05 fun
0.1 film
Modelpos Modelneg
filmlove this funI
0.10.1 0.01 0.050.10.10.001 0.01 0.0050.2
P(s|pos)>P(s|neg)
0.2 I
0.001 love
0.01 this
0.005 fun
0.1 film
Sec.13.2.1
10/29/19
18
Naïve Bayes in spam filtering
• SpamAssassin has a version that uses Naive Bayes and uses a variety of features of email messages: – Mentions Viagra (or other drugs)
– Online pharmacy
– Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)
– From: starts with many numbers
– Subject is all capitals
– Claims you can be removed from the list
– https://spamassassin.apache.org/old/tests_3_3_x.htm
Naïve Bayes Summary
• Very fast; low storage requirements
• Robust to irrelevant features Irrelevant features cancel each other without affecting results
• Optimal if the independence assumptions hold
• A good dependable baseline for text classification
– But we will see other classifiers that give better accuracy