+ All Categories
Home > Documents > PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures...

PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures...

Date post: 26-Dec-2015
Category:
Upload: harold-george
View: 219 times
Download: 1 times
Share this document with a friend
Popular Tags:
76
Prasad L13NaiveBayesClassify 1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning (Stanford)
Transcript
Page 1: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Prasad L13NaiveBayesClassify 1

Text Classification :The Naïve Bayes algorithm

Adapted from Lectures byPrabhakar Raghavan (Google) and

Christopher Manning (Stanford)

Page 2: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Relevance feedback revisited

In relevance feedback, the user marks a number of documents as relevant/nonrelevant, that can be used to improve search results.

Suppose we just tried to learn a filter for nonrelevant documents

This is an instance of a text classification problem: Two “classes”: relevant, nonrelevant For each document, decide whether it is relevant or

nonrelevant The notion of classification is very general and has

many applications within and beyond IR.

Prasad 2L13NaiveBayesClassify

Page 3: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Standing queries

The path from information retrieval to text classification: You have an information need, say:

Unrest in the Niger delta region You want to rerun an appropriate query

periodically to find new news items on this topic I.e., it’s classification not ranking

Such queries are called standing queries Long used by “information professionals” A modern mass instantiation is Google Alerts

13.0Prasad 3L13NaiveBayesClassify

Page 4: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Spam filtering: Another text classification task

From: "" <[email protected]>

Subject: real estate is the only way... gem oalvgkay

Anyone can buy real estate with no money down

Stop paying rent TODAY !

There is no need to spend hundreds or even thousands for similar courses

I am 22 years old and I have already purchased 6 properties using the

methods outlined in this truly INCREDIBLE ebook.

Change your life NOW !

=================================================

Click Below to order:

http://www.wholesaledaily.com/sales/nmd.htm

=================================================13.0

4

Page 5: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information Retrieval

3

Page 6: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Text classification: Naïve Bayes Text Classification

Today: Introduction to Text Classification

Also widely known as “text categorization”.

Probabilistic Language Models

Naïve Bayes text classification Multinomial Bernoulli

Feature Selection

Prasad 6L13NaiveBayesClassify

Page 7: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

8

Formal definition of Classification: Training/Testing

8

Given: A document space X

Documents are represented in this space – typically some type of high-dimensional space.

A fixed set of classes C = {c1, c2, . . . , cJ} The classes are human-defined for the needs of

an application (e.g., relevant vs. nonrelevant). A training set D of labeled documents with each

labeled document <d, c> X × C∈

Page 8: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

9

(cont’d)

9

Using a learning method or learning algorithm, we wish to learn a classifier ϒ that maps documents to classes:

ϒ : X → C

Given: a description d X of a document ∈Determine: ϒ (d) C, that is, the class that is most ∈

appropriate for d

Page 9: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

10

Topic classification

10

Page 10: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Multimedia GUIGarb.Coll.SemanticsML Planning

planningtemporalreasoningplanlanguage...

programmingsemanticslanguageproof...

learningintelligencealgorithmreinforcementnetwork...

garbagecollectionmemoryoptimizationregion...

“planning language proof intelligence”

TrainingData:

TestData:

Classes:(AI)

Document Classification

(Programming) (HCI)

... ...

(Note: in real life there is often a hierarchy, not present in the above problem statement; and also, you get papers on ML approaches to Garb. Coll.) 13.1

Prasad 11

Page 11: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

13

Examples of how search engines use classification

13

Language identification (classes: English vs. French etc.) The automatic detection of spam pages (spam vs.

nonspam) Topic-specific or vertical search – restrict search to a

“vertical” like “related to health” (relevant to vertical vs. not)

Standing queries (e.g., Google Alerts) Sentiment detection: is a movie or product review

positive or negative (positive vs. negative) The automatic detection of sexually explicit content

(sexually explicit vs. not)

Page 12: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Classification Methods (1)

Manual classification Used by Yahoo! (originally; now downplayed),

Looksmart, about.com, ODP, PubMed Very accurate when job is done by experts Consistent when the problem size and team is

small Difficult and expensive to scale

Means we need automatic classification methods for big problems

13.0Prasad 14L13NaiveBayesClassify

Page 13: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Classification Methods (2)

Automatic document classification Hand-coded rule-based systems

Used by CS dept’s spam filter, Reuters, CIA, etc. Companies (Verity) provide “IDE” for writing such rules

E.g., assign category if document contains a given boolean combination of words

Standing queries: Commercial systems have complex query languages (everything in IR query languages + accumulators)

Accuracy is often very high if a rule has been carefully refined over time by a subject expert

Building and maintaining these rules is expensive

13.0Prasad 15L13NaiveBayesClassify

Page 14: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

A Verity topic (a complex classification rule)

Note: maintenance issues

(author, etc.) Hand-weighting of

terms

16

Page 15: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Classification Methods (3) Supervised learning of a document-label

assignment function Many systems partly rely on machine learning

(Autonomy, MSN, Verity, Enkata, Yahoo!, …) k-Nearest Neighbors (simple, powerful) Naive Bayes (simple, common method) Support-vector machines (new, more powerful) … plus many other methods No free lunch: requires hand-classified training data But data can be built up (and refined) by amateurs

Note that many commercial systems use a mixture of methods

Prasad 17L13NaiveBayesClassify

Page 16: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Recall a few probability basics

For events a and b: Bayes’ Rule

Odds:)(1

)(

)(

)()(

ap

ap

ap

apaO

Posterior

Prior

13.2Prasad 19

Page 17: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Example of Diagnosis

p(disease | symptom)

=

p(symptom | disease)

*

p(disease) / p(symptom)

Prasad L13NaiveBayesClassify 20

Page 18: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Quick Summary:Conditioning illustrated using Venn diagram

Prasad L13NaiveBayesClassify 21

Page 19: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Bayesian Methods

Learning and classification methods based on probability theory. Bayes theorem plays a critical role in probabilistic

learning and classification. Build a generative model that approximates how

data is produced. Uses prior probability of each category given no

information about an item. Categorization produces a posterior probability

distribution over the possible categories given a description of an item (and prior probabilities).

13.2Prasad 22L13NaiveBayesClassify

Page 20: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information Retrieval

The bag of words representation

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.

γ( )=c

Page 21: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information Retrieval

The bag of words representation

γ( )=cgreat 2

love 2

recommend 1

laugh 1

happy 1

... ...

Page 22: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Overall Approach Summary Bayesian (robustness due to qualitative nature)

multivariate Bernoulli (counts of documents)

multinomial (counts of terms) Simplifying assumption for

computational tractability conditional independence positional independence

Prasad L13NaiveBayesClassify 25

Page 23: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Overall Approach Summary

Dealing with idiosyncracies of training data smoothening technique avoiding underflow using log

Improving efficiency and generalizability (noise reduction) feature selection

Prasad L13NaiveBayesClassify 26

Page 24: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Bayes’ Rule

P (C , D ) P (C | D ) P ( D ) P (D | C ) P (C )

P (C | D) P (D |C )P (C )

P (D)

Prasad 27L13NaiveBayesClassify

Page 25: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Naive Bayes Classifiers

Task: Classify a new instance D based on a tuple of attribute

values into one of the classes cj CnxxxD ,,, 21

),,,|(argmax 21 njCc

MAP xxxcPcj

),,,(

)()|,,,(argmax

21

21

n

jjn

Cc xxxP

cPcxxxP

j

)()|,,,(argmax 21 jjnCc

cPcxxxPj

Prasad 28L13NaiveBayesClassify

MAP = Maximum Aposteriori Probability

Page 26: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Naïve Bayes Classifier: Naïve Bayes Assumption

P(cj) Can be estimated from the frequency of classes in

the training examples. P(x1,x2,…,xn|cj)

O(|X|n•|C|) parameters Could only be estimated if a very, very large

number of training examples was available.Naïve Bayes Conditional Independence Assumption: Assume that the probability of observing the

conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).

Prasad 29L13NaiveBayesClassify

Page 27: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Flu

X1 X2 X5X3 X4

feversinus coughrunnynose muscle-ache

The Naïve Bayes Classifier

Conditional Independence Assumption: Features (term presence) are independent of each other given the class:

This model is appropriate for binary variables Multivariate Bernoulli model

)|()|()|()|,,( 52151 CXPCXPCXPCXXP

13.3Prasad 30

Page 28: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Learning the Model

First attempt: maximum likelihood estimates simply use the frequencies in the data

)(

),()|(ˆ

j

jiiji cCN

cCxXNcxP

C

X1 X2 X5X3 X4 X6

N

cCNcP jj

)()(ˆ

13.3Prasad 31

Page 29: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

33

The problem with maximum likelihood estimates: Zeros

33

P(China|d) ∝ P(China) ・ P(BEIJING|China) ・ P(AND|China) ・ P(TAIPEI|China) ・P(JOIN|China) ・ P(WTO|China)

If WTO never occurs in class China in the train set:

Page 30: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

34

The problem with maximum likelihood estimates: Zeros(cont)

34

If there were no occurrences of WTO in documents in class China, we’d get a zero estimate:

→ We will get P(China|d) = 0 for any document that contains WTO!

Zero probabilities cannot be conditioned away.

Page 31: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Smoothing to Avoid Overfitting

kcCN

cCxXNcxP

j

jiiji

)(

1),()|(ˆ

# of values of Xi

Prasad 35L13NaiveBayesClassify

Page 32: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

36

To avoid zeros: Add-one smoothing

36

Before:

Now: Add one to each count to avoid zeros:

B is the number of different words (in this case the size of the vocabulary: |V | = M)

Page 33: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Stochastic Language Models

Models probability of generating strings (each word in turn) in the language (commonly all strings over ∑). E.g., unigram model

0.2 the

0.1 a

0.01 man

0.01 woman

0.03 said

0.02 likes

the man likes the woman

0.2 0.01 0.02 0.2 0.01

multiply

Model M

P(s | M) = 0.00000008

13.2.1Prasad L13NaiveBayesClassify

Page 34: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Stochastic Language Models

Model probability of generating any string

0.2 the

0.01 class

0.0001 sayst

0.0001 pleaseth

0.0001 yon

0.0005 maiden

0.01 woman

Model M1 Model M2

maidenclass pleaseth yonthe

0.00050.01 0.0001 0.00010.2

0.010.0001 0.02 0.10.2

P(s|M2) > P(s|M1)

0.2 the

0.0001 class

0.03 sayst

0.02 pleaseth

0.1 yon

0.01 maiden

0.0001 woman

13.2.1

Page 35: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Unigram and higher-order models

Unigram Language Models

Bigram (generally, n-gram) Language Models

Other Language Models Grammar-based models (PCFGs), etc.

Probably not the first thing to try in IR

= P ( ) P ( | ) P ( | ) P ( | )

P ( ) P ( ) P ( ) P ( )

P ( )

P ( ) P ( | ) P ( | ) P ( | )

Easy.Effective!

13.2.1Prasad 40L13NaiveBayesClassify

Page 36: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method

Attributes are text positions, values are words.

Still too many possibilities Assume that classification is independent of the

positions of the words Use same parameters for each position Result is bag of words model (over tokens not types)

)|text""()|our""()(argmax

)|()(argmax

1j

j

jnjjCc

ijij

CcNB

cxPcxPcP

cxPcPc

Prasad 41L13NaiveBayesClassify

Page 37: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Underflow Prevention: log space

Multiplying lots of probabilities, which are between 0 and 1, can result in floating-point underflow.

Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.

Class with highest final un-normalized log probability score is still the most probable.

Note that model is now just max of sum of weights…

positionsi

jijCc

NB cxPcPc )|(log)(logargmaxj

Prasad 42L13NaiveBayesClassify

Page 38: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Note: Two Models

Model 1: Multivariate Bernoulli One feature Xw for each word in dictionary

Xw = true in document d if w appears in d Naive Bayes assumption:

Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears

Prasad 43L13NaiveBayesClassify

Page 39: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Two Models

Model 2: Multinomial = Class conditional unigram One feature Xi for each word pos in document

feature’s values are all words in dictionary Value of Xi is the word in position i Naïve Bayes assumption:

Given the document’s topic, word in one position in the document tells us nothing about words in other positions

Second assumption: Word appearance does not depend on position

)|()|( cwXPcwXP ji for all positions i,j, word w, and class c

Prasad 44

Page 40: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Parameter estimation ( 2 approaches)

fraction of documents of topic cj

in which word w appears

Multivariate Bernoulli model:

Multinomial model:

Can create a mega-document for topic j by concatenating all documents on this topic

Use frequency of w in mega-document

)|(ˆjw ctXP

fraction of times in which word w appears

across all documents of topic cj

)|(ˆji cwXP

Prasad 45L13NaiveBayesClassify

Page 41: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Classification

Multinomial vs Multivariate Bernoulli?

Multinomial model is almost always more effective in text applications! Casual occurrence distinguished from emphatic

occurrence in a long document

See IIR sections 13.2 and 13.3 for worked examples with each model

Prasad 46L13NaiveBayesClassify

Page 42: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Quick Summary:Conditioning illustrated using Venn diagram

Prasad L13NaiveBayesClassify 47

Page 43: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Quick Summary: Learning Problem

Learn a (one-of, multi-class) classifier by generalizing (interpolating and extrapolating) from a labeled training dataset to be used on a test set with similar distribution (representativeness)

Prasad L13NaiveBayesClassify 48

Page 44: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Quick Summary: Naïve Bayes Approach

FOR TRACTABILITY use “Bag of Words” model Conditional Independence: No mutual

dependence among words (cf. phrases such as San Francisco and Tahrir Square)

Positional Independence: No dependence on the position of occurrence in a sentence or paragraph (cf. articles and prepositions such as “The …”, “ … of …”, “ …and” and “ …at”.)

Prasad L13NaiveBayesClassify 49

Page 45: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Quick Summary: Naïve Bayes Approach

FOR PRACTICALITY Use (Laplace) Smoothing: To overcome

problems with sparse/incomplete training set

E.g.,

P(“Britain is a member of WTO” | UK) >> 0

in spite of the fact that

P(“WTO”|UK) ~ 0

Prasad L13NaiveBayesClassify 50

Page 46: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Quick Summary: Naïve Bayes Approach

FOR PRACTICALITY Use Logarithms: To overcome underflow

problems caused by small numbers E.g., Loglikelihood(hello) = -4.27 E.g., Loglikelihood(dear) = -4.57 E.g., Loglikelihood(dolly) = -5.49 E.g., Loglikelihood(stem) = -4.76 E.g., Loglikelihood(hello dear) = -7.16 E.g., Loglikelihood(hello dolly) = -6.83 E.g., Loglikelihood(hello stem) = -10.33

Prasad L13NaiveBayesClassify 51

Page 47: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Quick Summary: Naïve Bayes Approach

ESTIMATING PROBABILITY Multi-nominal Naïve Bayes Model: one count per

term for each term occurrence (cf. term frequency) Multivariate Bernoulli Model: one count per term

for each document it occurs in (cf. document frequency); Also takes into account non-occurrence

FEATURE SELECTION 2 statistic approach Information-theoretic approach

Prasad L13NaiveBayesClassify 52

Page 48: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

53

Overall Approach

53

Estimate parameters from the training corpus using add-one smoothing

For a new document, for each class, compute sum of (i) log of prior and (ii) logs of conditional probabilities of the terms

Assign the document to the class with the largest score

Page 49: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

54

Naive Bayes: Training

54

Page 50: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

55

Naive Bayes: Testing

55

Page 51: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

56

Exercise

56

Estimate parameters of Naive Bayes classifier Classify test document

Page 52: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

57

Example: Parameter estimates

57

The denominators are (8 + 6) and (3 + 6) because the lengths oftextc and are 8 and 3, respectively, and because the constant

B is 6 as the vocabulary consists of six terms.

Page 53: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

58

Example: Classification

58

Thus, the classifier assigns the test document to c = China. Thereason for this classification decision is that the three occurrencesof the positive indicator CHINESE in d5 outweigh the occurrences

of the two negative indicators JAPAN and TOKYO.

Page 54: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

62

Time complexity of Naive Bayes

62

Lave: average length of a training doc, La: length of the test doc, Ma: number of distinct terms in the test doc, training set, V : vocabulary, set of classes

is the time it takes to compute all counts. is the time it takes to compute the parameters

from the counts. Generally: Test time is also linear (in the length of the test document). Thus: Naive Bayes is linear in the size of the training set

(training) and the test document (testing). This is optimal.

Page 55: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

63

Bernoulli: Training

63

Page 56: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

64

Bernoulli : Testing

64

Page 57: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

65

Exercise

65

Estimate parameters of Bernoulli classifier Classify test document

Page 58: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

66

Example: Parameter estimates

66

The denominators are (3 + 2) and (1 + 2) because there are three documents in c and one document in c , and because the constant B is 2 as there are two cases to consider for each term, occurrence and nonoccurrence. (No information is probability = 0.5.)

Page 59: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

67

Example: Classification

67

Thus, the classifier assigns the test document to c = not-China. When looking only at binary occurrence and not at term frequency, Japan and Tokyo are indicators for c (2/3 > 1/5) and the conditional probabilities of Chinese for c and c are not different enough (4/5 vs. 2/3) to affect the classification decision.

Page 60: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

70

Why does Naive Bayes work?

70

Naive Bayes can work well even though conditional independence assumptions are badly violated.

Example:

Classification is about predicting the correct class and not about accurately estimating probabilities.

Correct estimation accurate prediction.⇒ But not vice versa!

Page 61: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Introduction to Information RetrievalIntroduction to Information Retrieval

71

Naive Bayes is not so naive

71

Naive Naive Bayes has won some bakeoffs (e.g., KDD-CUP 97) More robust to nonrelevant features than some more complex

learning methods More robust to concept drift (changing of definition of class over

time) than some more complex learning methods Better than methods like decision trees when we have many

equally important features A good dependable baseline for text classification Optimal if independence assumptions hold (never true for text, but

true for some domains) Very fast Low storage requirements

Page 62: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Feature Selection: Why?

Text collections have a large number of features 10,000 – 1,000,000 unique words … and more

Feature Selection Makes using a particular classifier feasible

Some classifiers can’t deal with 100,000 of features Reduces training time

Training time for some methods is quadratic or worse in the number of features

Can improve generalization (performance) Eliminates noise features Avoids overfitting

13.5Prasad 72L13NaiveBayesClassify

Page 63: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Feature selection: how?

Two ideas: Hypothesis testing statistics:

Are we confident that the value of one categorical variable is associated with the value of another

Chi-square test2) Information theory:

How much information does the value of one categorical variable give you about the value of another

Mutual information (MI) They’re similar, but 2 measures confidence in association,

(based on available statistics), while MI measures extent of association (assuming perfect knowledge of probabilities)

13.5Prasad 73L13NaiveBayesClassify

Page 64: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

2 statistic (CHI)

2 test is used to test the independence of two events, which for us is occurrence of a term and occurrence of a class.

N.. = Observed Frequency E.. = Expected frequency

13.5.274

Page 65: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Example from Reuters-RCV1 (term export class poultry)

Prasad L13NaiveBayesClassify 75

Page 66: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Computing expected frequency

Prasad L13NaiveBayesClassify 76

where, N = 49 + 141 + 27,652 + 774,106 = 801,948

Page 67: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Observed frequencies and computed expected frequencies

Prasad L13NaiveBayesClassify 77

where, N = 49 + 141 + 27,652 + 774,106 = 801,948

Page 68: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Interpretation of CHI-squared results

Prasad L13NaiveBayesClassify 78

X2 value of 284 >> 10.83 implies that poultry and export are dependent with 99.9% certainty, as the possibility of them being independent is as remote as < 0.1%

Page 69: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Feature selection via Mutual Information

In training set, choose k words which best discriminate (give most info. on) the categories.

The Mutual Information between a word, class is:

For each word w and each category c

}1,0{ }1,0{ )()(

),(log),(),(

w ce e cw

cwcw epep

eepeepcwI

13.5.1Prasad 81L13NaiveBayesClassify

Page 70: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Formula for computation (Equation 13.17)

Prasad L13NaiveBayesClassify 82

N01 is number of documents that contain the term t (et = 1)and are not in class c (ec = 0).N1. is number of documents that contain the term t (et = 1).

Page 71: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Example

Prasad L13NaiveBayesClassify 83

Page 72: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Feature selection via MI (contd.) For each category we build a list of k most

discriminating terms. For example (on 20 Newsgroups):

sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, …

rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, …

Greedy: does not account for correlations between terms

Prasad 84L13NaiveBayesClassify

Page 73: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Features with High MI scores for 6 Reuters-RCV1 classes

Prasad L13NaiveBayesClassify 85

Page 74: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Feature Selection

Mutual Information Clear information-theoretic interpretation May select rare uninformative terms

Chi-square Statistical foundation May select very slightly informative frequent terms

that are not very useful for classification Just use the commonest terms?

No particular foundation In practice, this is often 90% as good

Prasad 86L13NaiveBayesClassify

Page 75: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

Feature selection for NB : Overall Significance

In general, feature selection is necessary for multivariate Bernoulli NB. Otherwise, you suffer from noise, multi-counting.

“Feature selection” really means something different for multinomial NB. It means dictionary truncation. This “feature selection” normally isn’t needed for

multinomial NB.

Prasad 87L13NaiveBayesClassify

Page 76: PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures by Prabhakar Raghavan (Google) and Christopher Manning.

SpamAssassin

Naïve Bayes has found a home in spam filtering Paul Graham’s A Plan for Spam

A mutant with more mutant offspring... Naive Bayes-like classifier with weird parameter

estimation Widely used in spam filters

Classic Naive Bayes superior when appropriately used According to David D. Lewis

But also many other things: black hole lists, etc.

Many email topic filters also use NB classifiersPrasad 93L13NaiveBayesClassify


Recommended