Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology:...

transcript

Naïve Bayes, Maxent and Neural Models

CMSC 473/673

Some slides adapted from 3SLP

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

Probabilistic Classification

Discriminatively trained classifier

Generatively trained classifier

Directly model the posterior

Model the posterior with

Bayes rule

Posterior Classification/Decodingmaximum a posteriori

Noisy Channel Model Decoding

Posterior Decoding:Probabilistic Text Classification

Assigning subject categories, topics, or genres

Spam detection

Authorship identification

Age/gender identification

Language Identification

Sentiment analysis

observed data

class-based likelihood (language model)

prior probability of

observation likelihood (averaged over all classes)

Noisy Channel Model

what I want to tell you“sports”

what you actually see“The Os lost

again…”

Decode Rerank

hypothesized intent

“sad stories”“sports”

reweight according to what’s likely

“sports”

Noisy Channel

Machine translation

Speech-to-text

Spelling correction

Text normalization

Part-of-speech tagging

Morphological analysis

Image captioning

possible (clean) output

observed (noisy) text

(clean) language

observation (noisy) likelihood

translation/decode model

Use Logarithms

Accuracy, Precision, and Recall

Accuracy: % of items correct

Precision: % of selected items that are correct

Recall: % of correct items that are selected

Actually Correct Actually Incorrect

Selected/Guessed True Positive (TP) False Positive (FP)

Not select/not guessed False Negative (FN) True Negative (TN)

A combined measure: F

Weighted (harmonic) average of Precision & Recall

Balanced F1 measure: β=1

The Bag of Words Representation

Bag of Words Representation

γ( )=c

seen 2

sweet 1

whimsical 1

recommend 1

happy 1

... ...classifier

classifier

Naïve Bayes Classifier

Start with Bayes Rule

label text

Q: Are we doing discriminative training or generative training?

Start with Bayes Rule

label text

Q: Are we doing discriminative training or generative training?

A: generative training

Adopt naïve bag of words representation Y i

label each word (token)

Assume position doesn’t matter

Assume the feature probabilities are independent given the class X

Multinomial Naïve Bayes: Learning

From training corpus, extract Vocabulary

Calculate P(cj) terms

For each cj in C dodocsj = all docs with class =cj

Brill and Banko (2001)

With enough data, the classifier may not matter

Calculate P(cj) terms

For each cj in C dodocsj = all docs with class =cj

Calculate P(wk | cj) termsTextj = single doc containing all docsj

For each word wk in Vocabularynk = # of occurrences of wk in Textj

𝑝 𝑤𝑘| 𝑐𝑗 = class unigram LM

Naïve Bayes and Language Modeling

Naïve Bayes classifiers can use any sort of feature

But if, as in the previous slides

We use only word features

we use all of the words in the text (not a subset)

Naïve Bayes has an important similarity to language modeling

Naïve Bayes as a Language Model

Sec.13.2.1

0.1 love

0.01 this

0.05 fun

0.1 film

Positive Model Negative Model

0.001 love

0.01 this

0.005 fun

0.1 film

Which class assigns the higher probability to s?

filmlove this funI

Sec.13.2.1

0.1 love

0.01 this

0.05 fun

0.1 film

0.001 love

0.01 this

0.005 fun

0.1 film

0.1 love

0.01 this

0.05 fun

0.1 film

filmlove this funI

0.10.1 0.01 0.050.1

0.10.001 0.01 0.0050.2

0.001 love

0.01 this

0.005 fun

0.1 film

Sec.13.2.1

0.1 love

0.01 this

0.05 fun

0.1 film

filmlove this funI

0.10.1 0.01 0.050.1

0.10.001 0.01 0.0050.2

5e-7 ≈ P(s|pos) > P(s|neg) ≈ 1e-9

0.001 love

0.01 this

0.005 fun

0.1 film

Sec.13.2.1

Summary: Naïve Bayes is Not So Naïve

Very Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

But: Naïve Bayes Isn’t Without Issue

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data? (automated, more principled)

Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regression

Maximum Entropy models (MaxEnt)

Generalized Linear Models

Discriminative Naïve Bayes

Very shallow (sigmoidal) neural nets

as statistical regression

a form of

viewed as

based in information theory

to be cool today :)

Maxent Models for Classification: Discriminatively or Generatively Trained

Discriminatively trained classifier

Generatively trained classifier

Directly model the posterior

Model the posterior with

Bayes rule

Maximum Entropy (Log-linear) Models

discriminatively trained:classify in one go

Maximum Entropy (Log-linear) Models

generatively trained:learn to model language

Document Classification

ATTACKThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

ATTACK

• # killed:

• Type:

• Perp:

shot ATTACK

ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously woundedas a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

ATTACK

We need to score the different combinations.

Score and Combine Our Possibilities

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

COMBINEposterior

probability of ATTACK

are all of these uncorrelated?

…scorek(department, ATTACK)

Score and Combine Our Possibilities

COMBINEposterior

probability of ATTACK

Q: What are the score and combine functions for Naïve

Bayes?

Scoring Our Possibilities

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

score( , ) =ATTACK

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/

https://goo.gl/BQCdH9

Lesson 1

Three people have been fatally shot, and five people, including a mayor, were seriously wounded

as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

SNAP(score( , ))ATTACK

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

What function…

operates on any real number?

is never less than 0?

What function…

operates on any real number?

is never less than 0?

f(x) = exp(x)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded

as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

exp(score( , ))ATTACK

p( | )∝ATTACK

exp( ))score1(fatally shot, ATTACK)

score3(Shining Path, ATTACK)…

p( | )∝ATTACK

exp( ))score1(fatally shot, ATTACK)

score3(Shining Path, ATTACK)…

Maxent Modeling

Learn the scores (but we’ll declare what combinations should be looked at)

p( | )∝ATTACK

exp( ))weight1 * occurs1(fatally shot, ATTACK)

weight2 * occurs2(seriously wounded, ATTACK)

weight3 * occurs3(Shining Path, ATTACK)…

p( | )∝ATTACK

exp( ))weight1 * occurs1(fatally shot, ATTACK)

weight3 * occurs3(Shining Path, ATTACK)…

Maxent Modeling: Feature Functions

Feature functions help extract useful features (characteristics) of the

Generally templated

Often binary-valued (0 or 1), but can be real-

valued

occurstarget,type fatally shot,ATTACK =

ቊ1, target == fatally shot and type == ATTACK

0, otherwise

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝ATTACK

binary

More on Feature FunctionsFeature functions help extract useful features (characteristics) of the data

Generally templated

Often binary-valued (0 or 1), but can be real-valued

occurstarget,type fatally shot, ATTACK =

log 𝑝 fatally shot ATTACK)+ log 𝑝 type ATTACK)+ log 𝑝(ATTACK |type)

Templated real-valued

occurs fatally shot, ATTACK =log 𝑝 fatally shot ATTACK)

Non-templated real-valued

Non-templated count-valued

0, otherwise

binary

More on Feature FunctionsFeature functions help extract useful features (characteristics) of the data

Generally templated

Often binary-valued (0 or 1), but can be real-valued

occurs fatally shot, ATTACK =log 𝑝 fatally shot ATTACK)

Non-templated real-valued

occurs fatally shot, ATTACK =count fatally sho𝑡 ATTACK)

Non-templated count-valued

log 𝑝 fatally shot ATTACK)+ log 𝑝 type ATTACK)+ log 𝑝(ATTACK |type)

Templated real-valued

0, otherwise

binary

p( | ) =ATTACK

exp( ))weight1 * applies1(fatally shot, ATTACK)

weight2 * applies2(seriously wounded, ATTACK)

weight3 * applies3(Shining Path, ATTACK)…

Maxent Modeling

Q: How do we define Z?

exp( )…

Σlabel x

Z =Normalization for Classification

𝑝 𝑥 𝑦) ∝ exp(𝜃 ⋅ 𝑓 𝑥, 𝑦 ) classify doc y with label x in one go

weight1 * occurs1(fatally shot, ATTACK)

weight3 * occurs3(Shining Path, ATTACK)

Normalization for Language Model

general class-based (X) language model of doc y

Can be significantly harder in the general case

Simplifying assumption: maxent n-grams!

Understanding Conditioning

Is this a good language model?

Is this a good language model? (no)

Is this a good posterior classifier? (no)

Lesson 11

pθ(x | y ) probabilistic model

objective(given observations)

Objective = Full Likelihood?

These values can have very small magnitude ➔ underflow

Differentiating this product could be a pain

Logarithms

(0, 1] ➔ (-∞, 0]

Products ➔ Sums

log(ab) = log(a) + log(b)

log(a/b) = log(a) – log(b)

Inverse of exp

log(exp(x)) = x

Log-Likelihood

Differentiating this becomes nicer (even

though Z depends on θ)

Wide range of (negative) numbers

Sums are more stable

Products ➔ Sums

log(ab) = log(a) + log(b)

log(a/b) = log(a) – log(b)

Log-Likelihood

Differentiating this becomes nicer (even

though Z depends on θ)

Inverse of explog(exp(x)) = x

Log-Likelihood

= 𝐹 𝜃

How will we optimize F(θ)?

Calculus

F’(θ)derivative of F

wrt θ

Example

F’(x) = -2x + 4

F(x) = -(x-2)2

differentiate

Solve F’(x) = 0

Common Derivative Rules

F’(θ)derivative of F wrt θ

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

Set t = 0

Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)

3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

Gradient = Multi-variable derivative

K-dimensional input

K-dimensional output

Gradient Ascent

Expectations

1 2 3 4 5 6

number of pieces of candy

1/6 * 1 +1/6 * 2 +1/6 * 3 +1/6 * 4 +1/6 * 5 + 1/6 * 6

Expectations

1 2 3 4 5 6

1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6

Expectations

1 2 3 4 5 6

1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6

Expectations

1 2 3 4 5 6

1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6

Log-Likelihood

Differentiating this becomes nicer (even though Z depends

on θ)= 𝐹 𝜃

Log-Likelihood Gradient

Each component k is the difference between:

the total value of feature fk in the training data

the total value the current model pθ

thinks it computes for feature fk

Lesson 6

Log-Likelihood Gradient Derivation

𝑦𝑖

𝜕𝜃log𝑔(ℎ 𝜃 ) =

𝜕𝑔

𝜕ℎ(𝜃)

𝜕ℎ

𝜕𝜃

use the (calculus) chain rulescalar p(x’ | yi)

vector of functions

𝑦𝑖

Do we want these to fully match?

What does it mean if they do?

What if we have missing values in our data?

Gradient Optimization

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1 𝜕𝐹

𝜕𝜃𝑘=

𝑓𝑘 𝑥𝑖 , 𝑦𝑖 −

𝑦′

𝑓𝑘 𝑥𝑖 , 𝑦′ 𝑝 𝑦′ 𝑥𝑖)

Do we want these to fully match?

What does it mean if they do?

What if we have missing values in our data?

Preventing Extreme Values

Naïve Bayes

Extreme values are 0 probabilities

Naïve Bayes Log-linear models

Extreme values are 0 probabilities Extreme values are large θ values

Naïve Bayes Log-linear models

Extreme values are 0 probabilities Extreme values are large θ values

regularization

(Squared) L2 Regularization

Lesson 8

Revisiting the SNAP Function

softmax

Revisiting the SNAP Function

softmax

N-gram Language Models

predict the next word

given some context…wi-3 wi-2

given some context…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ 𝑐𝑜𝑢𝑛𝑡(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖)

wi-3 wi-2

compute beliefs about what is likely…

given some context…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ 𝑐𝑜𝑢𝑛𝑡(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖)

wi-3 wi-2

Maxent Language Models

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃 ⋅ 𝑓(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖))

Neural Language Models

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃 ⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖))

can we learn the feature function(s)?

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝒘𝒊⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))

can we learn the feature function(s) for justthe context?

can we learn word-specific weights (by type)?

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝑤𝑖⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))

create/use “distributed representations”… ei-3 ei-2 ei-1ew

create/use “distributed representations”… ei-3 ei-2 ei-1

combine these representations… C = f

matrix-vector product

“A Neural Probabilistic Language Model,” Bengio et al. (2003)

Baselines

LM NameN-

gramParams.

Test Ppl.

Interpolation 3 --- 336

Kneser-Ney backoff

3 --- 323

Kneser-Neybackoff

5 --- 321

Class-based backoff

classes312

Class-based backoff

classes312

Baselines

LM NameN-

gramParams.

Test Ppl.

Kneser-Ney backoff

3 --- 323

Kneser-Neybackoff

5 --- 321

Class-based backoff

classes312

Class-based backoff

classes312

N-gramWord Vector Dim.

Hidden Dim.

Mix with non-

neural LM

5 60 50 No 268

5 60 50 Yes 257

5 30 100 No 276

5 30 100 Yes 252

Baselines

LM NameN-

gramParams.

Test Ppl.

Kneser-Ney backoff

3 --- 323

Kneser-Neybackoff

5 --- 321

Class-based backoff

classes312

Class-based backoff

classes312

N-gramWord Vector Dim.

Hidden Dim.

Mix with non-

neural LM

5 60 50 No 268

5 60 50 Yes 257

5 30 100 No 276

5 30 100 Yes 252

“we were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs)” (Sect. 4.2)

Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology:...

Documents