PrasadL13NaiveBayesClassify1 Text Classification : The Naïve Bayes algorithm Adapted from Lectures...

transcript

Prasad L13NaiveBayesClassify 1

Text Classification :The Naïve Bayes algorithm

Adapted from Lectures byPrabhakar Raghavan (Google) and

Christopher Manning (Stanford)

Relevance feedback revisited

In relevance feedback, the user marks a number of documents as relevant/nonrelevant, that can be used to improve search results.

Suppose we just tried to learn a filter for nonrelevant documents

This is an instance of a text classification problem: Two “classes”: relevant, nonrelevant For each document, decide whether it is relevant or

nonrelevant The notion of classification is very general and has

many applications within and beyond IR.

Prasad 2L13NaiveBayesClassify

Standing queries

The path from information retrieval to text classification: You have an information need, say:

Unrest in the Niger delta region You want to rerun an appropriate query

periodically to find new news items on this topic I.e., it’s classification not ranking

Such queries are called standing queries Long used by “information professionals” A modern mass instantiation is Google Alerts

13.0Prasad 3L13NaiveBayesClassify

Spam filtering: Another text classification task

From: "" <takworlld@hotmail.com>

Subject: real estate is the only way... gem oalvgkay

Anyone can buy real estate with no money down

Stop paying rent TODAY !

There is no need to spend hundreds or even thousands for similar courses

I am 22 years old and I have already purchased 6 properties using the

methods outlined in this truly INCREDIBLE ebook.

Change your life NOW !

=================================================

Click Below to order:

http://www.wholesaledaily.com/sales/nmd.htm

=================================================13.0

Introduction to Information Retrieval

Text classification: Naïve Bayes Text Classification

Today: Introduction to Text Classification

Also widely known as “text categorization”.

Probabilistic Language Models

Naïve Bayes text classification Multinomial Bernoulli

Feature Selection

Introduction to Information RetrievalIntroduction to Information Retrieval

Formal definition of Classification: Training/Testing

Given: A document space X

Documents are represented in this space – typically some type of high-dimensional space.

A fixed set of classes C = {c1, c2, . . . , cJ} The classes are human-defined for the needs of

an application (e.g., relevant vs. nonrelevant). A training set D of labeled documents with each

labeled document <d, c> X × C∈

(cont’d)

Using a learning method or learning algorithm, we wish to learn a classifier ϒ that maps documents to classes:

ϒ : X → C

Given: a description d X of a document ∈Determine: ϒ (d) C, that is, the class that is most ∈

appropriate for d

Topic classification

Multimedia GUIGarb.Coll.SemanticsML Planning

planningtemporalreasoningplanlanguage...

programmingsemanticslanguageproof...

learningintelligencealgorithmreinforcementnetwork...

garbagecollectionmemoryoptimizationregion...

“planning language proof intelligence”

TrainingData:

TestData:

Classes:(AI)

Document Classification

(Programming) (HCI)

... ...

(Note: in real life there is often a hierarchy, not present in the above problem statement; and also, you get papers on ML approaches to Garb. Coll.) 13.1

Prasad 11

Examples of how search engines use classification

Language identification (classes: English vs. French etc.) The automatic detection of spam pages (spam vs.

nonspam) Topic-specific or vertical search – restrict search to a

“vertical” like “related to health” (relevant to vertical vs. not)

Standing queries (e.g., Google Alerts) Sentiment detection: is a movie or product review

positive or negative (positive vs. negative) The automatic detection of sexually explicit content

(sexually explicit vs. not)

Classification Methods (1)

Manual classification Used by Yahoo! (originally; now downplayed),

Looksmart, about.com, ODP, PubMed Very accurate when job is done by experts Consistent when the problem size and team is

small Difficult and expensive to scale

Means we need automatic classification methods for big problems

Classification Methods (2)

Automatic document classification Hand-coded rule-based systems

Used by CS dept’s spam filter, Reuters, CIA, etc. Companies (Verity) provide “IDE” for writing such rules

E.g., assign category if document contains a given boolean combination of words

Standing queries: Commercial systems have complex query languages (everything in IR query languages + accumulators)

Accuracy is often very high if a rule has been carefully refined over time by a subject expert

Building and maintaining these rules is expensive

A Verity topic (a complex classification rule)

Note: maintenance issues

(author, etc.) Hand-weighting of

Classification Methods (3) Supervised learning of a document-label

assignment function Many systems partly rely on machine learning

(Autonomy, MSN, Verity, Enkata, Yahoo!, …) k-Nearest Neighbors (simple, powerful) Naive Bayes (simple, common method) Support-vector machines (new, more powerful) … plus many other methods No free lunch: requires hand-classified training data But data can be built up (and refined) by amateurs

Note that many commercial systems use a mixture of methods

Recall a few probability basics

For events a and b: Bayes’ Rule

Odds:)(1

Posterior

13.2Prasad 19

Example of Diagnosis

p(disease | symptom)

p(symptom | disease)

p(disease) / p(symptom)

Quick Summary:Conditioning illustrated using Venn diagram

Bayesian Methods

Learning and classification methods based on probability theory. Bayes theorem plays a critical role in probabilistic

learning and classification. Build a generative model that approximates how

data is produced. Uses prior probability of each category given no

information about an item. Categorization produces a posterior probability

distribution over the possible categories given a description of an item (and prior probabilities).

The bag of words representation

I love this movie! It's sweet, but with satirical humor. The dialogue is great and the adventure scenes are fun… It manages to be whimsical and romantic while laughing at the conventions of the fairy tale genre. I would recommend it to just about anyone. I've seen it several times, and I'm always happy to see it again whenever I have a friend who hasn't seen it yet.

γ( )=c

The bag of words representation

γ( )=cgreat 2

love 2

recommend 1

laugh 1

happy 1

... ...

Overall Approach Summary Bayesian (robustness due to qualitative nature)

multivariate Bernoulli (counts of documents)

multinomial (counts of terms) Simplifying assumption for

computational tractability conditional independence positional independence

Overall Approach Summary

Dealing with idiosyncracies of training data smoothening technique avoiding underflow using log

Improving efficiency and generalizability (noise reduction) feature selection

Bayes’ Rule

P (C , D ) P (C | D ) P ( D ) P (D | C ) P (C )

P (C | D) P (D |C )P (C )

Naive Bayes Classifiers

Task: Classify a new instance D based on a tuple of attribute

values into one of the classes cj CnxxxD ,,, 21

),,,|(argmax 21 njCc

MAP xxxcPcj

)()|,,,(argmax

Cc xxxP

cPcxxxP

)()|,,,(argmax 21 jjnCc

cPcxxxPj

MAP = Maximum Aposteriori Probability

Naïve Bayes Classifier: Naïve Bayes Assumption

P(cj) Can be estimated from the frequency of classes in

the training examples. P(x1,x2,…,xn|cj)

O(|X|n•|C|) parameters Could only be estimated if a very, very large

number of training examples was available.Naïve Bayes Conditional Independence Assumption: Assume that the probability of observing the

conjunction of attributes is equal to the product of the individual probabilities P(xi|cj).

X1 X2 X5X3 X4

feversinus coughrunnynose muscle-ache

The Naïve Bayes Classifier

Conditional Independence Assumption: Features (term presence) are independent of each other given the class:

This model is appropriate for binary variables Multivariate Bernoulli model

)|()|()|()|,,( 52151 CXPCXPCXPCXXP

13.3Prasad 30

Learning the Model

First attempt: maximum likelihood estimates simply use the frequencies in the data

),()|(ˆ

jiiji cCN

cCxXNcxP

X1 X2 X5X3 X4 X6

cCNcP jj

)()(ˆ

13.3Prasad 31

The problem with maximum likelihood estimates: Zeros

If WTO never occurs in class China in the train set:

The problem with maximum likelihood estimates: Zeros(cont)

If there were no occurrences of WTO in documents in class China, we’d get a zero estimate:

→ We will get P(China|d) = 0 for any document that contains WTO!

Zero probabilities cannot be conditioned away.

Smoothing to Avoid Overfitting

cCxXNcxP

1),()|(ˆ

# of values of Xi

To avoid zeros: Add-one smoothing

Before:

Now: Add one to each count to avoid zeros:

B is the number of different words (in this case the size of the vocabulary: |V | = M)

Stochastic Language Models

Models probability of generating strings (each word in turn) in the language (commonly all strings over ∑). E.g., unigram model

0.2 the

0.01 man

0.01 woman

0.03 said

0.02 likes

the man likes the woman

0.2 0.01 0.02 0.2 0.01

multiply

Model M

P(s | M) = 0.00000008

13.2.1Prasad L13NaiveBayesClassify

Stochastic Language Models

Model probability of generating any string

0.2 the

0.01 class

0.0001 sayst

0.0001 pleaseth

0.0001 yon

0.0005 maiden

0.01 woman

Model M1 Model M2

maidenclass pleaseth yonthe

0.00050.01 0.0001 0.00010.2

0.010.0001 0.02 0.10.2

P(s|M2) > P(s|M1)

0.2 the

0.0001 class

0.03 sayst

0.02 pleaseth

0.1 yon

0.01 maiden

0.0001 woman

13.2.1

Unigram and higher-order models

Unigram Language Models

Bigram (generally, n-gram) Language Models

Other Language Models Grammar-based models (PCFGs), etc.

Probably not the first thing to try in IR

= P ( ) P ( | ) P ( | ) P ( | )

P ( ) P ( ) P ( ) P ( )

P ( ) P ( | ) P ( | ) P ( | )

Easy.Effective!

13.2.1Prasad 40L13NaiveBayesClassify

Using Multinomial Naive Bayes Classifiers to Classify Text: Basic method

Attributes are text positions, values are words.

Still too many possibilities Assume that classification is independent of the

positions of the words Use same parameters for each position Result is bag of words model (over tokens not types)

)|text""()|our""()(argmax

)|()(argmax

jnjjCc

cxPcxPcP

cxPcPc

Underflow Prevention: log space

Multiplying lots of probabilities, which are between 0 and 1, can result in floating-point underflow.

Since log(xy) = log(x) + log(y), it is better to perform all computations by summing logs of probabilities rather than multiplying probabilities.

Class with highest final un-normalized log probability score is still the most probable.

Note that model is now just max of sum of weights…

positionsi

NB cxPcPc )|(log)(logargmaxj

Note: Two Models

Model 1: Multivariate Bernoulli One feature Xw for each word in dictionary

Xw = true in document d if w appears in d Naive Bayes assumption:

Given the document’s topic, appearance of one word in the document tells us nothing about chances that another word appears

Two Models

Model 2: Multinomial = Class conditional unigram One feature Xi for each word pos in document

feature’s values are all words in dictionary Value of Xi is the word in position i Naïve Bayes assumption:

Given the document’s topic, word in one position in the document tells us nothing about words in other positions

Second assumption: Word appearance does not depend on position

)|()|( cwXPcwXP ji for all positions i,j, word w, and class c

Prasad 44

Parameter estimation ( 2 approaches)

fraction of documents of topic cj

in which word w appears

Multivariate Bernoulli model:

Multinomial model:

Can create a mega-document for topic j by concatenating all documents on this topic

Use frequency of w in mega-document

)|(ˆjw ctXP

fraction of times in which word w appears

across all documents of topic cj

)|(ˆji cwXP

Classification

Multinomial vs Multivariate Bernoulli?

Multinomial model is almost always more effective in text applications! Casual occurrence distinguished from emphatic

occurrence in a long document

See IIR sections 13.2 and 13.3 for worked examples with each model

Quick Summary:Conditioning illustrated using Venn diagram

Quick Summary: Learning Problem

Learn a (one-of, multi-class) classifier by generalizing (interpolating and extrapolating) from a labeled training dataset to be used on a test set with similar distribution (representativeness)

Quick Summary: Naïve Bayes Approach

FOR TRACTABILITY use “Bag of Words” model Conditional Independence: No mutual

dependence among words (cf. phrases such as San Francisco and Tahrir Square)

Positional Independence: No dependence on the position of occurrence in a sentence or paragraph (cf. articles and prepositions such as “The …”, “ … of …”, “ …and” and “ …at”.)

FOR PRACTICALITY Use (Laplace) Smoothing: To overcome

problems with sparse/incomplete training set

P(“Britain is a member of WTO” | UK) >> 0

in spite of the fact that

P(“WTO”|UK) ~ 0

FOR PRACTICALITY Use Logarithms: To overcome underflow

problems caused by small numbers E.g., Loglikelihood(hello) = -4.27 E.g., Loglikelihood(dear) = -4.57 E.g., Loglikelihood(dolly) = -5.49 E.g., Loglikelihood(stem) = -4.76 E.g., Loglikelihood(hello dear) = -7.16 E.g., Loglikelihood(hello dolly) = -6.83 E.g., Loglikelihood(hello stem) = -10.33

ESTIMATING PROBABILITY Multi-nominal Naïve Bayes Model: one count per

term for each term occurrence (cf. term frequency) Multivariate Bernoulli Model: one count per term

for each document it occurs in (cf. document frequency); Also takes into account non-occurrence

FEATURE SELECTION 2 statistic approach Information-theoretic approach

Overall Approach

Estimate parameters from the training corpus using add-one smoothing

For a new document, for each class, compute sum of (i) log of prior and (ii) logs of conditional probabilities of the terms

Assign the document to the class with the largest score

Naive Bayes: Training

Naive Bayes: Testing

Exercise

Estimate parameters of Naive Bayes classifier Classify test document

Example: Parameter estimates

The denominators are (8 + 6) and (3 + 6) because the lengths oftextc and are 8 and 3, respectively, and because the constant

B is 6 as the vocabulary consists of six terms.

Example: Classification

Thus, the classifier assigns the test document to c = China. Thereason for this classification decision is that the three occurrencesof the positive indicator CHINESE in d5 outweigh the occurrences

of the two negative indicators JAPAN and TOKYO.

Time complexity of Naive Bayes

Lave: average length of a training doc, La: length of the test doc, Ma: number of distinct terms in the test doc, training set, V : vocabulary, set of classes

is the time it takes to compute all counts. is the time it takes to compute the parameters

from the counts. Generally: Test time is also linear (in the length of the test document). Thus: Naive Bayes is linear in the size of the training set

(training) and the test document (testing). This is optimal.

Bernoulli: Training

Bernoulli : Testing

Exercise

Estimate parameters of Bernoulli classifier Classify test document

Example: Parameter estimates

The denominators are (3 + 2) and (1 + 2) because there are three documents in c and one document in c , and because the constant B is 2 as there are two cases to consider for each term, occurrence and nonoccurrence. (No information is probability = 0.5.)

Example: Classification

Thus, the classifier assigns the test document to c = not-China. When looking only at binary occurrence and not at term frequency, Japan and Tokyo are indicators for c (2/3 > 1/5) and the conditional probabilities of Chinese for c and c are not different enough (4/5 vs. 2/3) to affect the classification decision.

Why does Naive Bayes work?

Naive Bayes can work well even though conditional independence assumptions are badly violated.

Example:

Classification is about predicting the correct class and not about accurately estimating probabilities.

Correct estimation accurate prediction.⇒ But not vice versa!

Naive Bayes is not so naive

Naive Naive Bayes has won some bakeoffs (e.g., KDD-CUP 97) More robust to nonrelevant features than some more complex

learning methods More robust to concept drift (changing of definition of class over

time) than some more complex learning methods Better than methods like decision trees when we have many

equally important features A good dependable baseline for text classification Optimal if independence assumptions hold (never true for text, but

true for some domains) Very fast Low storage requirements

Feature Selection: Why?

Text collections have a large number of features 10,000 – 1,000,000 unique words … and more

Feature Selection Makes using a particular classifier feasible

Some classifiers can’t deal with 100,000 of features Reduces training time

Training time for some methods is quadratic or worse in the number of features

Can improve generalization (performance) Eliminates noise features Avoids overfitting

Feature selection: how?

Two ideas: Hypothesis testing statistics:

Are we confident that the value of one categorical variable is associated with the value of another

Chi-square test2) Information theory:

How much information does the value of one categorical variable give you about the value of another

Mutual information (MI) They’re similar, but 2 measures confidence in association,

(based on available statistics), while MI measures extent of association (assuming perfect knowledge of probabilities)

2 statistic (CHI)

2 test is used to test the independence of two events, which for us is occurrence of a term and occurrence of a class.

N.. = Observed Frequency E.. = Expected frequency

13.5.274

Example from Reuters-RCV1 (term export class poultry)

Computing expected frequency

where, N = 49 + 141 + 27,652 + 774,106 = 801,948

Observed frequencies and computed expected frequencies

where, N = 49 + 141 + 27,652 + 774,106 = 801,948

Interpretation of CHI-squared results

X2 value of 284 >> 10.83 implies that poultry and export are dependent with 99.9% certainty, as the possibility of them being independent is as remote as < 0.1%

Feature selection via Mutual Information

In training set, choose k words which best discriminate (give most info. on) the categories.

The Mutual Information between a word, class is:

For each word w and each category c

}1,0{ }1,0{ )()(

),(log),(),(

w ce e cw

cwcw epep

eepeepcwI

13.5.1Prasad 81L13NaiveBayesClassify

Formula for computation (Equation 13.17)

N01 is number of documents that contain the term t (et = 1)and are not in class c (ec = 0).N1. is number of documents that contain the term t (et = 1).

Example

Feature selection via MI (contd.) For each category we build a list of k most

discriminating terms. For example (on 20 Newsgroups):

sci.electronics: circuit, voltage, amp, ground, copy, battery, electronics, cooling, …

rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, …

Greedy: does not account for correlations between terms

Features with High MI scores for 6 Reuters-RCV1 classes

Feature Selection

Mutual Information Clear information-theoretic interpretation May select rare uninformative terms

Chi-square Statistical foundation May select very slightly informative frequent terms

that are not very useful for classification Just use the commonest terms?

No particular foundation In practice, this is often 90% as good

Feature selection for NB : Overall Significance

In general, feature selection is necessary for multivariate Bernoulli NB. Otherwise, you suffer from noise, multi-counting.

“Feature selection” really means something different for multinomial NB. It means dictionary truncation. This “feature selection” normally isn’t needed for

multinomial NB.

SpamAssassin

Naïve Bayes has found a home in spam filtering Paul Graham’s A Plan for Spam

A mutant with more mutant offspring... Naive Bayes-like classifier with weird parameter

estimation Widely used in spam filters

Classic Naive Bayes superior when appropriately used According to David D. Lewis

But also many other things: black hole lists, etc.

Many email topic filters also use NB classifiersPrasad 93L13NaiveBayesClassify