14 naive bayes - Colorado State Universitycs440/fall19/slides/14_naive_bayes.pdf · conventions are...

10/29/19

1

Text classification: The Naive Bayes Classifier

Chapter 4 in Martin/Jurafsky

Is this spam?

10/29/19

2

Who wrote which Federalist papers? •  1787-8: anonymous essays try to convince New York to ratify

U.S Constitution: Jay, Madison, Hamilton.

•  Authorship of 12 of the letters in dispute

•  1963: solved by Mosteller and Wallace using Bayesian methods

James Madison Alexander Hamilton

-2851$/�2)�7+(� $0(5,&$1�67$7,67,&$/� $662&,$7,21�

$1XPEHU�� -81(�� 9ROXPH��

,1)(5(1&(� ,1� $1�$87+256,,,3� 352%/(0��

$�FRPSDUDWLYH�VWXG\�RI�GLVFULPLQDWLRQ�PHWKRGV�DSSOLHG�WR�WKH�DXWKRUVKLS�RI�WKH�GLVSXWHG�)HGHUDOLVW�SDSHUV�

)5('(5,&.�0267(//(5�+DUYDUG�8QLYHUVLW\�

DQG�&HQWHU�IRU�$GYDQFHG�6WXG\�LQ�WKH�%HKDYLRUDO�6FLHQFHV�

$1'�

'$9,'�/��:$//$&(�8QLYHUVLW\�RI�&KLFDJR�

7KLV�VWXG\�KDV�IRXU�SXUSRVHV��WR�SURYLGH�D�FRPSDULVRQ�RI�GLVFULPL��QDWLRQ�PHWKRGV��WR�H[SORUH�WKH�SUREOHPV�SUHVHQWHG�E\�WHFKQLTXHV�EDVHG�VWURQJO\�RQ�%D\HV�WKHRUHP�ZKHQ�WKH\�DUH�XVHG�LQ�D�GDWD�DQDO\VLV�RI�ODUJH�VFDOH��WR�VROYH�WKH�DXWKRUVKLS�TXHVWLRQ�RI�7KH�)HGHUDOLVW�SDSHUV��DQG�WR�SURSRVH�URXWLQH�PHWKRGV�IRU�VROYLQJ�RWKHU�DXWKRUVKLS�SUREOHPV��:RUG�FRXQWV�DUH�WKH�YDULDEOHV�XVHG�IRU�GLVFULPLQDWLRQ��6LQFH�WKH�

WRSLF�ZULWWHQ�DERXW�KHDYLO\�LQIOXHQFHV�WKH�UDWH�ZLWK�ZKLFK�D�ZRUG�LV�XVHG��FDUH�LQ�VHOHFWLRQ�RI�ZRUGV�LV�QHFHVVDU\��7KH�ILOOHU�ZRUGV�RI�WKH�ODQJXDJH�VXFK�DV� DQ��RI��DQG�XSRQ��DQG��PRUH�JHQHUDOO\��DUWLFOHV��SUHSRVLWLRQV��DQG�FRQMXQFWLRQV�SURYLGH�IDLUO\�VWDEOH�UDWHV��ZKHUHDV�PRUH�PHDQLQJIXO�ZRUGV�OLNH�ZDU��H[HFXWLYH��DQG�OHJLVODWXUH�GR�QRW��$IWHU�DQ�LQYHVWLJDWLRQ�RI�WKH�GLVWULEXWLRQ�RI�WKHVH�FRXQWV��WKH�DXWKRUV�

H[HFXWH�DQ�DQDO\VLV�HPSOR\LQJ�WKH�XVXDO�GLVFULPLQDQW�IXQFWLRQ�DQG�DQ�DQDO\VLV�EDVHG�RQ�%D\HVLDQ�PHWKRGV��7KH�FRQFOXVLRQV�DERXW�WKH�DXWKRU��VKLS�SUREOHP�DUH�WKDW�0DGLVRQ�UDWKHU�WKDQ�+DPLOWRQ�ZURWH�DOO��RI�WKH�GLVSXWHG�SDSHUV��7KH�ILQGLQJV�DERXW�PHWKRGV�DUH�SUHVHQWHG�LQ�WKH�FORVLQJ�VHFWLRQ�RQ�

FRQFOXVLRQV��7KLV�UHSRUW��VXPPDUL]LQJ�DQG�DEEUHYLDWLQJ�D�IRUWKFRPLQJ�PRQRJUDSK�

>�@��JLYHV�VRPH�RI�WKH�UHVXOWV�EXW�YHU\�OLWWOH�RI�WKHLU�HPSLULFDO�DQG�WKHRUHWLFDO�IRXQGDWLRQ��,W�WUHDWV�WZR�RI�WKH�IRXU�PDLQ�VWXGLHV�SUHVHQWHG�LQ�WKH�PRQRJUDSK��DQG�QRQH�RI�WKH�VLGH�VWXGLHV��

,�7KLV�ZRUN�KDV�EHHQ�IDFLOLDWHG�E\�JUDQWV�IURP�7KH�)RUG�)RXQGDWLRQ��WKH�5RFNHIHOOHU�)RXQGDWLRQ��DQG�IURP�WKH�1DWLRQDO�6FLHQFH�)RXQGDWLRQ�16)�*��DQG�*��FRQWUDFWV�ZLWK�WKH�2IILFH�RI�1DYDO�5HVHDUFK�1RQU��DQG��DQG�WKH�/DERUDWRU\�RI�6RFLDO�5HODWLRQV��+DUYDUG�8QLYHUVLW\��7KH�ZRUN�ZDV�GRQH�LQ�SDUW�DW�WKH�0DVVDFKXVHWWV�,QVWLWXWH�RI�7HFKQRORJ\�&RPSXWDWLRQ�&HQWHU��&DPEULGJH��0DVVDFKXVHWWV��DQG�DW�WKH�&HQWHU�IRU�$GYDQFHG�6WXG\�LQ�WKH�%HKDYLRUDO�6FLHQFHV��6WDQIRUG��&DOLIRUQLD��3HUPLVVLRQ�LV�JUDQWHG�IRU�UHSURGXFWLRQ�LQ�ZKROH�RU�LQ�SDUW�IRU�SXUSRVHV�RI�WKH�8QLWHG�6WDWHV�*RYHUQPHQW��

��3UHVHQWHG�DW�D�VHVVLRQ�RI�6SHFLDO�3DSHUV�,QYLWHG�E\�WKH�3UHVLGHQWV�RI�7KH�$PHULFDQ�6WDWLVWLFDO�$VVRFLDWLRQ��7KH�%LRPHWULF�6RFLHW\��(1$5��DQG�7KH�,QVWLWXWH�RI�0DWKHPDWLFDO�6WDWLVWLFV�DW�WKH�VWDWLVWLFDO�PHHWLQJV�LQ�0LQQHDSROLV��0LQQHVRWD��6HSWHPEHU��

��

This content downloaded from 128.122.184.172 on Mon, 19 Aug 2013 18:55:54 PMAll use subject to JSTOR Terms and Conditions

Positive or negative movie review?

•  unbelievably disappointing

•  Full of zany characters and richly applied satire, and some great plot twists

•  this is the greatest screwball comedy ever filmed

•  It was pathetic. The worst part about it was the boxing scenes.

10/29/19

3

What is the subject of this article?

•  Antogonists and Inhibitors

•  Blood Supply

•  Chemistry

•  Drug Therapy

•  Embryology

•  Epidemiology

•  …

MeSH Subject Category Hierarchy

?

MEDLINE article

Text classification

•  Assigning subject categories, topics, or genres

•  Spam detection

•  Authorship identification

•  Age/gender identification

•  Language identification

•  Sentiment analysis

•  …

10/29/19

4

Text classification: problem definition

•  Input: –  a document d

–  a fixed set of classes C = {c1, c2,…, cJ}

•  Output: a predicted class c ∈ C (and a level of confidence in the prediction)

Classification methods: Hand-coded rules

•  Rules based on combinations of words or other features

–  spam: black-list-address OR (“dollars” AND “have been selected”)

•  Accuracy can be high

–  If rules carefully refined by expert

•  Difficulty?

10/29/19

5

Classification methods: Hand-coded rules

•  Rules based on combinations of words or other features

–  spam: black-list-address OR (“dollars” AND“have been selected”)

•  Accuracy can be high

–  If rules carefully refined by expert

•  Building and maintaining these rules is time consuming

Classification using supervised machine learning

•  Input: –  a document d

–  a fixed set of classes C = {c1, c2,…, cJ}

–  A training set of N hand-labeled documents (d1,c1),....,(dN,cN)

•  Output: –  a learned classifier which is a mapping from the set of

documents to the set of labels

10/29/19

6

Classification methods: Supervised machine learning •  Any kind of classifier can be used for this task:

–  Naïve Bayes

–  Logistic regression

–  Support-vector machines

–  Neural networks

–  …

Naïve Bayes Intuition

•  Simple (“naïve”) classification method based on Bayes rule

•  Relies on very simple representation of document

– Bag of words

10/29/19

7

The "bag of words" representation

it

it

itit

it

it

I

I

I

I

I

love

recommend

movie

thethe

the

the

to

to

to

and

andand

seen

seen

yet

would

with

who

whimsical

whilewhenever

times

sweet

several

scenes

satirical

romanticof

manages

humor

have

happy

fun

friend

fairy

dialogue

but

conventions

areanyone

adventure

always

again

about

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun... It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet!

it Ithetoandseenyetwouldwhimsicaltimessweetsatiricaladventuregenrefairyhumorhavegreat…

6 54332111111111111…

it

it

itit

it

it

I

I

I

I

I

love

recommend

movie

thethe

the

the

to

to

to

and

andand

seen

seen

yet

would

with

who

whimsical

whilewhenever

times

sweet

several

scenes

satirical

romanticof

manages

humor

have

happy

fun

friend

fairy

dialogue

but

conventions

areanyone

adventure

always

again

about



6 54332111111111111…

it

it

itit

it

it

I

I

I

I

I

love

recommend

movie

thethe

the

the

to

to

to

and

andand

seen

seen

yet

would

with

who

whimsical

whilewhenever

times

sweet

several

scenes

satirical

romanticof

manages

humor

have

happy

fun

friend

fairy

dialogue

but

conventions

areanyone

adventure

always

again

about



6 54332111111111111…

Classification with probabilistic models

•  To classify a document d choose the class that has the highest probability:

4.1 • NAIVE BAYES CLASSIFIERS 65

4.1 Naive Bayes Classifiers

In this section we introduce the multinomial naive Bayes classifier, so called be-naive Bayesclassifier

cause it is a Bayesian classifier that makes a simplifying (naive) assumption abouthow the features interact.

The intuition of the classifier is shown in Fig. 4.1. We represent a text documentas if it were a bag-of-words, that is, an unordered set of words with their positionbag-of-words

ignored, keeping only their frequency in the document. In the example in the figure,instead of representing the word order in all the phrases like “I love this movie” and“I would recommend it”, we simply note that the word I occurred 5 times in theentire excerpt, the word it 6 times, the words love, recommend, and movie once, andso on.

it

it

itit

it

it

I

I

I

I

I

love

recommend

movie

thethe

the

the

to

to

to

and

andand

seen

seen

yet

would

with

who

whimsical

whilewhenever

times

sweet

several

scenes

satirical

romanticof

manages

humor

have

happy

fun

friend

fairy

dialogue

but

conventions

areanyone

adventure

always

again

about



6 54332111111111111…

Figure 4.1 Intuition of the multinomial naive Bayes classifier applied to a movie review. The position of thewords is ignored (the bag of words assumption) and we make use of the frequency of each word.

Naive Bayes is a probabilistic classifier, meaning that for a document d, out ofall classes c 2 C the classifier returns the class c which has the maximum posteriorprobability given the document. In Eq. 4.1 we use the hat notation ˆ to mean “ourˆestimate of the correct class”.

c = argmaxc2C

P(c|d) (4.1)

This idea of Bayesian inference has been known since the work of Bayes (1763),Bayesianinference

and was first applied to text classification by Mosteller and Wallace (1964). The in-tuition of Bayesian classification is to use Bayes’ rule to transform Eq. 4.1 into otherprobabilities that have some useful properties. Bayes’ rule is presented in Eq. 4.2;it gives us a way to break down any conditional probability P(x|y) into three other

10/29/19

8

Bayes’ rule From the product rule:

P(x, y) = P(y|x) P(x)

and:

P(x, y) = P(x|y) P(y)

Therefore:

This is known as Bayes’ rule

P (y|x) = P (x|y)P (y)

P (x)

Bayes’ rule for classification of documents

P(c | d) = P(d | c)P(c)P(d)

• For a document d and a class c

10/29/19

9

MAP classification

cMAP = argmaxc∈C

P(c | d)

= argmaxc∈C

P(d | c)P(c)P(d)

= argmaxc∈C

P(d | c)P(c)

MAP is “maximum a posteriori” = most likely class

Bayes Rule

Dropping the denominator

MAP classification

cMAP = argmaxc∈C

P(d | c)P(c)

Document d represented as features x1,...,xn

= argmaxc∈C

P(x1, x2,…, xn | c)P(c)

10/29/19

10

MAP classification

How often does this class occur?

cMAP = argmaxc∈C

P(x1, x2,…, xn | c)P(c)

O(|X|n|C|)parameters

We can just count the relative frequencies in a corpus

Couldonlybees@matedifavery,verylargenumberoftrainingexampleswasavailable.

Naïve Bayes independence assumption

P(x1, x2,…, xn | c)•  Conditional Independence: Assume the feature

probabilities P(xi|cj) are independent given the class c:

P(x1,…, xn | c) = P(x1 | c)P(x2 | c)P(x3 | c),...,P(xn | c)

10/29/19

11

Multinomial Naïve Bayes

•  Conditional Independence: Assume the feature probabilities P(xi|cj) are independent given the class c.

•  Bag of Words assumption: position doesn’t matter; the variables represent counts/presence absence of a word in a document

P(x1,…, xn | c) = P(x1 | c)P(x2 | c)P(x3 | c),...,P(xn | c)

Multinomial Naïve Bayes Classifier

cMAP = argmaxc∈C

P(x1, x2,…, xn | c)P(c)

cNB = argmaxc∈C

P(c) P(xi | c)i∏

How many parameters in a model with a vocabulary V?

10/29/19

12

Learning the Multinomial Naïve Bayes Model

•  First attempt: maximum likelihood estimates –  simply use the frequencies in the data

Sec.13.3

P(wi | c) =count(wi,c)count(w,c)

w∈V∑

P(c) = Nc

Ndoc

fraction of documents in class c

fraction of times word w appears among all words in documents of topic c V – the vocabulary

Maximum likelihood •  Fit a probabilistic model P(x|θ) to data

–  Estimate θ

•  Given independent identically distributed (i.i.d.) data X = (x1, x2, …, xn)

–  Likelihood

–  Log likelihood

•  Maximum likelihood solution: parameters θ that maximize ln P(X|θ)

P (X|✓) = P (x1|✓)P (x2|✓), . . . , P (xn|✓)

lnP (X|✓) =nX

i=1

lnP (xi|✓)

10/29/19

13

Example •  Example: coin toss

•  Estimate the probability p that a coin lands “Heads” using the result of n coin tosses, h of which resulted in heads.

•  The likelihood of the data:

•  Log likelihood:

•  Taking a derivative and setting to 0:

P (X|✓) = ph(1� p)n�h

lnP (X|✓) = h ln p+ (n� h) ln(1� p)

@ lnP (X|✓)@p

=h

p� (n� h)

(1� p)= 0

) p =h

n

•  Create mega-document for topic c by concatenating all docs in this topic –  Use frequency of w in mega-document

Parameter estimation

P(wi | c) =count(wi,c)count(w,c)

w∈V∑

10/29/19

14

Problem with zeros

•  What if we have seen no training documents with the word fantastic and classified in the topic positive (thumbs-up)?

•  Zero probabilities cannot be conditioned away, no matter the other evidence!

P("fantastic" positive) = count("fantastic", positive)count(w, positive

w∈V∑ )

= 0

cMAP = argmaxc P(c) P(xi | c)i∏

Sec.13.3

Laplace (add-1) smoothing for Naïve Bayes

P(wi | c) =count(wi,c)+1count(w,c)+1( )

w∈V∑

=count(wi,c)+1

count(w,cw∈V∑ )⎛

⎝⎜

⎞

⎠⎟ + V

10/29/19

15

Multinomial Naïve Bayes: Learning

•  Calculate P(c) terms

P(wk | c)←nk +αn+α |V |

P(c)← Nc

Ndoc

•  Calculate P(wk | c) terms • For each word wk in the vocabulary V: nk ← # of occurrences of wk in documents in class c n ← total number of words in documents in class c

•  From training corpus, extract vocabulary V

making a prediction: P(c|d5) P(j|d5)

1/4 * (2/9)3 * 2/9 * 2/9 ≈ 0.0001

Doc Words Class

Training 1 Chinese Beijing Chinese c 2 Chinese Chinese Shanghai c 3 Chinese Macao c 4 Tokyo Japan Chinese j

Test 5 Chinese Chinese Chinese Tokyo Japan ?

Conditional Probabilities: P(Chinese|c) = P(Tokyo|c) = P(Japan|c) = P(Chinese|j) = P(Tokyo|j) = P(Japan|j) =

Priors: P(c)=

P(j)=

3 4 1

4

(5+1) / (8+6) = 6/14 = 3/7 (0+1) / (8+6) = 1/14

(1+1) / (3+6) = 2/9 (0+1) / (8+6) = 1/14

(1+1) / (3+6) = 2/9 (1+1) / (3+6) = 2/9

3/4 * (3/7)3 * 1/14 * 1/14 ≈ 0.0003

∝

∝

10/29/19

16

Multinomial Naïve Bayes as a generative model

c=China

X1=Shanghai X2=and X3=Shenzhen X4=issue X5=bonds

Naïve Bayes and Language Modeling

•  Naïve Bayes classifiers can use any sort of feature –  URL, email address, network features etc.

•  But if, as in the previous slides –  We use only word features

•  Then –  Naïve Bayes is related to language models.

10/29/19

17

Each class is a unigram language model

•  Assigning each word: P(word | c)

•  Assigning each sentence: P(s|c)=Π P(word|c)

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

…

I love this fun film

0.1 0.1 .05 0.01 0.1

Class pos

P(s|pos)=0.0000005

Sec.13.2.1

Naïve Bayes as a Language Model

•  Which class assigns the higher probability to s?

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

Modelpos Modelneg

filmlove this funI

0.10.1 0.01 0.050.10.10.001 0.01 0.0050.2

P(s|pos)>P(s|neg)

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film

Sec.13.2.1

10/29/19

18

Naïve Bayes in spam filtering

•  SpamAssassin has a version that uses Naive Bayes and uses a variety of features of email messages: –  Mentions Viagra (or other drugs)

–  Online pharmacy

–  Mentions millions of (dollar) ((dollar) NN,NNN,NNN.NN)

–  From: starts with many numbers

–  Subject is all capitals

–  Claims you can be removed from the list

–  https://spamassassin.apache.org/old/tests_3_3_x.htm

Naïve Bayes Summary

•  Very fast; low storage requirements

•  Robust to irrelevant features Irrelevant features cancel each other without affecting results

•  Optimal if the independence assumptions hold

•  A good dependable baseline for text classification

–  But we will see other classifiers that give better accuracy

Date post:	23-Jun-2020
Category:	Documents
Upload:	others
View:	4 times
Download:	0 times

14 naive bayes - Colorado State Universitycs440/fall19/slides/14_naive_bayes.pdf · conventions are...

Documents