Naïve Bayes, Maxent and Neural Models · Naïve Bayes (NB) classification Terminology:...

Post on 19-Jul-2020

5 views 0 download

transcript

Naïve Bayes, Maxent and Neural Models

CMSC 473/673

UMBC

Some slides adapted from 3SLP

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

Probabilistic Classification

Discriminatively trained classifier

Generatively trained classifier

Directly model the posterior

Model the posterior with

Bayes rule

Posterior Classification/Decodingmaximum a posteriori

Noisy Channel Model Decoding

Posterior Decoding:Probabilistic Text Classification

Assigning subject categories, topics, or genres

Spam detection

Authorship identification

Age/gender identification

Language Identification

Sentiment analysis

class

observed data

class-based likelihood (language model)

prior probability of

class

observation likelihood (averaged over all classes)

Noisy Channel Model

what I want to tell you“sports”

what you actually see“The Os lost

again…”

Decode Rerank

hypothesized intent

“sad stories”“sports”

reweight according to what’s likely

“sports”

Noisy Channel

Machine translation

Speech-to-text

Spelling correction

Text normalization

Part-of-speech tagging

Morphological analysis

Image captioning

possible (clean) output

observed (noisy) text

(clean) language

model

observation (noisy) likelihood

translation/decode model

Use Logarithms

Accuracy, Precision, and Recall

Accuracy: % of items correct

Precision: % of selected items that are correct

Recall: % of correct items that are selected

Actually Correct Actually Incorrect

Selected/Guessed True Positive (TP) False Positive (FP)

Not select/not guessed False Negative (FN) True Negative (TN)

A combined measure: F

Weighted (harmonic) average of Precision & Recall

Balanced F1 measure: β=1

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

The Bag of Words Representation

The Bag of Words Representation

The Bag of Words Representation

13

Bag of Words Representation

γ( )=c

seen 2

sweet 1

whimsical 1

recommend 1

happy 1

... ...classifier

classifier

Naïve Bayes Classifier

Start with Bayes Rule

label text

Q: Are we doing discriminative training or generative training?

Naïve Bayes Classifier

Start with Bayes Rule

label text

Q: Are we doing discriminative training or generative training?

A: generative training

Naïve Bayes Classifier

Adopt naïve bag of words representation Y i

label each word (token)

Naïve Bayes Classifier

Adopt naïve bag of words representation Y i

Assume position doesn’t matter

label each word (token)

Naïve Bayes Classifier

Adopt naïve bag of words representation Y i

Assume position doesn’t matter

Assume the feature probabilities are independent given the class X

label each word (token)

Multinomial Naïve Bayes: Learning

From training corpus, extract Vocabulary

Multinomial Naïve Bayes: Learning

Calculate P(cj) terms

For each cj in C dodocsj = all docs with class =cj

From training corpus, extract Vocabulary

Brill and Banko (2001)

With enough data, the classifier may not matter

Multinomial Naïve Bayes: Learning

Calculate P(cj) terms

For each cj in C dodocsj = all docs with class =cj

Calculate P(wk | cj) termsTextj = single doc containing all docsj

For each word wk in Vocabularynk = # of occurrences of wk in Textj

From training corpus, extract Vocabulary

𝑝 𝑤𝑘| 𝑐𝑗 = class unigram LM

Naïve Bayes and Language Modeling

Naïve Bayes classifiers can use any sort of feature

But if, as in the previous slides

We use only word features

we use all of the words in the text (not a subset)

Then

Naïve Bayes has an important similarity to language modeling

Naïve Bayes as a Language Model

Sec.13.2.1

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

Positive Model Negative Model

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

filmlove this funI

Sec.13.2.1

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

Positive Model Negative Model

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

Positive Model Negative Model

filmlove this funI

0.10.1 0.01 0.050.1

0.10.001 0.01 0.0050.2

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film

Sec.13.2.1

Naïve Bayes as a Language Model

Which class assigns the higher probability to s?

0.1 I

0.1 love

0.01 this

0.05 fun

0.1 film

Positive Model Negative Model

filmlove this funI

0.10.1 0.01 0.050.1

0.10.001 0.01 0.0050.2

5e-7 ≈ P(s|pos) > P(s|neg) ≈ 1e-9

0.2 I

0.001 love

0.01 this

0.005 fun

0.1 film

Sec.13.2.1

Summary: Naïve Bayes is Not So Naïve

Very Fast, low storage requirements

Robust to Irrelevant Features

Very good in domains with many equally important features

Optimal if the independence assumptions hold

Dependable baseline for text classification (but often not the best)

But: Naïve Bayes Isn’t Without Issue

Model the posterior in one go?

Are the features really uncorrelated?

Are plain counts always appropriate?

Are there “better” ways of handling missing/noisy data? (automated, more principled)

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

Connections to Other Techniques

Log-Linear Models

(Multinomial) logistic regression

Softmax regression

Maximum Entropy models (MaxEnt)

Generalized Linear Models

Discriminative Naïve Bayes

Very shallow (sigmoidal) neural nets

as statistical regression

a form of

viewed as

based in information theory

to be cool today :)

Maxent Models for Classification: Discriminatively or Generatively Trained

Discriminatively trained classifier

Generatively trained classifier

Directly model the posterior

Model the posterior with

Bayes rule

Maximum Entropy (Log-linear) Models

discriminatively trained:classify in one go

Maximum Entropy (Log-linear) Models

generatively trained:learn to model language

Document Classification

ATTACKThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )ATTACK

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Document Classification

ATTACK

• # killed:

• Type:

• Perp:

shot ATTACK

Document Classification

ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Document Classification

ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Document Classification

ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously woundedas a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Document Classification

ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

Document Classification

ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

ATTACK

Three people have beenfatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region.

We need to score the different combinations.

Score and Combine Our Possibilities

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

COMBINEposterior

probability of ATTACK

are all of these uncorrelated?

…scorek(department, ATTACK)

Score and Combine Our Possibilities

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

COMBINEposterior

probability of ATTACK

Q: What are the score and combine functions for Naïve

Bayes?

Scoring Our Possibilities

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

score( , ) =ATTACK

score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/

https://goo.gl/BQCdH9

Lesson 1

Three people have been fatally shot, and five people, including a mayor, were seriously wounded

as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

SNAP(score( , ))ATTACK

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

What function…

operates on any real number?

is never less than 0?

What function…

operates on any real number?

is never less than 0?

f(x) = exp(x)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded

as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

exp(score( , ))ATTACK

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

exp( ))score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)…

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

exp( ))score1(fatally shot, ATTACK)

score2(seriously wounded, ATTACK)

score3(Shining Path, ATTACK)…

Maxent Modeling

Learn the scores (but we’ll declare what combinations should be looked at)

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

exp( ))weight1 * occurs1(fatally shot, ATTACK)

weight2 * occurs2(seriously wounded, ATTACK)

weight3 * occurs3(Shining Path, ATTACK)…

Maxent ModelingThree people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | )∝ATTACK

exp( ))weight1 * occurs1(fatally shot, ATTACK)

weight2 * occurs2(seriously wounded, ATTACK)

weight3 * occurs3(Shining Path, ATTACK)…

Maxent Modeling: Feature Functions

Feature functions help extract useful features (characteristics) of the

data

Generally templated

Often binary-valued (0 or 1), but can be real-

valued

occurstarget,type fatally shot,ATTACK =

ቊ1, target == fatally shot and type == ATTACK

0, otherwise

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junin department, central Peruvian mountain region .

p( | )∝ATTACK

binary

More on Feature FunctionsFeature functions help extract useful features (characteristics) of the data

Generally templated

Often binary-valued (0 or 1), but can be real-valued

occurstarget,type fatally shot, ATTACK =

log 𝑝 fatally shot ATTACK)+ log 𝑝 type ATTACK)+ log 𝑝(ATTACK |type)

Templated real-valued

occurs fatally shot, ATTACK =log 𝑝 fatally shot ATTACK)

Non-templated real-valued

Non-templated count-valued

???

occurstarget,type fatally shot, ATTACK =

ቊ1, target == fatally shot and type == ATTACK

0, otherwise

binary

More on Feature FunctionsFeature functions help extract useful features (characteristics) of the data

Generally templated

Often binary-valued (0 or 1), but can be real-valued

occurs fatally shot, ATTACK =log 𝑝 fatally shot ATTACK)

Non-templated real-valued

occurs fatally shot, ATTACK =count fatally sho𝑡 ATTACK)

Non-templated count-valued

occurstarget,type fatally shot, ATTACK =

log 𝑝 fatally shot ATTACK)+ log 𝑝 type ATTACK)+ log 𝑝(ATTACK |type)

Templated real-valued

occurstarget,type fatally shot, ATTACK =

ቊ1, target == fatally shot and type == ATTACK

0, otherwise

binary

Three people have been fatally shot, and five people, including a mayor, were seriously wounded as a result of a Shining Path attack today against a community in Junindepartment, central Peruvian mountain region .

p( | ) =ATTACK

exp( ))weight1 * applies1(fatally shot, ATTACK)

weight2 * applies2(seriously wounded, ATTACK)

weight3 * applies3(Shining Path, ATTACK)…

Maxent Modeling

1

Z

Q: How do we define Z?

exp( )…

Σlabel x

Z =Normalization for Classification

𝑝 𝑥 𝑦) ∝ exp(𝜃 ⋅ 𝑓 𝑥, 𝑦 ) classify doc y with label x in one go

weight1 * occurs1(fatally shot, ATTACK)

weight2 * occurs2(seriously wounded, ATTACK)

weight3 * occurs3(Shining Path, ATTACK)

Normalization for Language Model

general class-based (X) language model of doc y

Normalization for Language Model

Can be significantly harder in the general case

general class-based (X) language model of doc y

Normalization for Language Model

Can be significantly harder in the general case

Simplifying assumption: maxent n-grams!

general class-based (X) language model of doc y

Understanding Conditioning

Is this a good language model?

Understanding Conditioning

Is this a good language model?

Understanding Conditioning

Is this a good language model? (no)

Understanding Conditioning

Is this a good posterior classifier? (no)

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/

https://goo.gl/BQCdH9

Lesson 11

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

pθ(x | y ) probabilistic model

objective(given observations)

Objective = Full Likelihood?

These values can have very small magnitude ➔ underflow

Differentiating this product could be a pain

Logarithms

(0, 1] ➔ (-∞, 0]

Products ➔ Sums

log(ab) = log(a) + log(b)

log(a/b) = log(a) – log(b)

Inverse of exp

log(exp(x)) = x

Log-Likelihood

Differentiating this becomes nicer (even

though Z depends on θ)

Wide range of (negative) numbers

Sums are more stable

Products ➔ Sums

log(ab) = log(a) + log(b)

log(a/b) = log(a) – log(b)

Log-Likelihood

Wide range of (negative) numbers

Sums are more stable

Differentiating this becomes nicer (even

though Z depends on θ)

Inverse of explog(exp(x)) = x

Log-Likelihood

Wide range of (negative) numbers

Sums are more stable

= 𝐹 𝜃

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

How will we optimize F(θ)?

Calculus

F(θ)

θ

F(θ)

θθ*

F(θ)

θ

F’(θ)derivative of F

wrt θ

θ*

Example

F’(x) = -2x + 4

F(x) = -(x-2)2

differentiate

Solve F’(x) = 0

x = 2

Common Derivative Rules

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)

θ0

y0

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)

θ0

y0

g0

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

g0

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

g0

g1

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

y2

y3

θ3

g0

g1 g2

F(θ)

θ

F’(θ)derivative of F wrt θ

θ*

What if you can’t find the roots? Follow the derivative

Set t = 0

Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)

3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1θ0

y0

θ1

y1

θ2

y2

y3

θ3

g0

g1 g2

Gradient = Multi-variable derivative

K-dimensional input

K-dimensional output

Gradient Ascent

Gradient Ascent

Gradient Ascent

Gradient Ascent

Gradient Ascent

Gradient Ascent

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

Expectations

1 2 3 4 5 6

number of pieces of candy

1/6 * 1 +1/6 * 2 +1/6 * 3 +1/6 * 4 +1/6 * 5 + 1/6 * 6

= 3.5

Expectations

1 2 3 4 5 6

number of pieces of candy

1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6

= 2.5

Expectations

1 2 3 4 5 6

number of pieces of candy

1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6

= 2.5

Expectations

1 2 3 4 5 6

number of pieces of candy

1/2 * 1 +1/10 * 2 +1/10 * 3 +1/10 * 4 +1/10 * 5 + 1/10 * 6

= 2.5

Log-Likelihood

Wide range of (negative) numbers

Sums are more stable

Differentiating this becomes nicer (even though Z depends

on θ)= 𝐹 𝜃

Log-Likelihood Gradient

Each component k is the difference between:

Log-Likelihood Gradient

Each component k is the difference between:

the total value of feature fk in the training data

Log-Likelihood Gradient

Each component k is the difference between:

the total value of feature fk in the training data

and

the total value the current model pθ

thinks it computes for feature fk

X' Yi

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/

https://goo.gl/BQCdH9

Lesson 6

Log-Likelihood Gradient Derivation

𝑦𝑖

Log-Likelihood Gradient Derivation

𝑦𝑖

Log-Likelihood Gradient Derivation

𝜕

𝜕𝜃log𝑔(ℎ 𝜃 ) =

𝜕𝑔

𝜕ℎ(𝜃)

𝜕ℎ

𝜕𝜃

use the (calculus) chain rulescalar p(x’ | yi)

vector of functions

𝑦𝑖

Log-Likelihood Gradient Derivation

Do we want these to fully match?

What does it mean if they do?

What if we have missing values in our data?

Gradient Optimization

Set t = 0Pick a starting value θt

Until converged:1. Get value y t = F(θ t)2. Get derivative g t = F’(θ t)3. Get scaling factor ρ t

4. Set θ t+1 = θ t + ρ t *g t

5. Set t += 1 𝜕𝐹

𝜕𝜃𝑘=

𝑖

𝑓𝑘 𝑥𝑖 , 𝑦𝑖 −

𝑖

𝑦′

𝑓𝑘 𝑥𝑖 , 𝑦′ 𝑝 𝑦′ 𝑥𝑖)

Do we want these to fully match?

What does it mean if they do?

What if we have missing values in our data?

Preventing Extreme Values

Naïve Bayes

Extreme values are 0 probabilities

Preventing Extreme Values

Naïve Bayes Log-linear models

Extreme values are 0 probabilities Extreme values are large θ values

Preventing Extreme Values

Naïve Bayes Log-linear models

Extreme values are 0 probabilities Extreme values are large θ values

regularization

(Squared) L2 Regularization

https://www.csee.umbc.edu/courses/undergraduate/473/f18/loglin-tutorial/

https://goo.gl/BQCdH9

Lesson 8

OutlineRecap: classification (MAP vs. noisy channel) & evaluation

Naïve Bayes (NB) classificationTerminology: bag-of-words“Naïve” assumptionTraining & performanceNB as a language

Maximum Entropy classifiersDefining the modelDefining the objectiveLearning: Optimizing the objectiveMath: gradient derivation

Neural (language) models

Revisiting the SNAP Function

softmax

Revisiting the SNAP Function

softmax

N-gram Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

N-gram Language Models

predict the next word

given some context…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ 𝑐𝑜𝑢𝑛𝑡(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖)

wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

N-gram Language Models

predict the next word

given some context…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ 𝑐𝑜𝑢𝑛𝑡(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖)

wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

Maxent Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃 ⋅ 𝑓(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖))

Neural Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃 ⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1,𝑤𝑖))

can we learn the feature function(s)?

Neural Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝒘𝒊⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))

can we learn the feature function(s) for justthe context?

can we learn word-specific weights (by type)?

Neural Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝑤𝑖⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))

create/use “distributed representations”… ei-3 ei-2 ei-1ew

Neural Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝑤𝑖⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))

create/use “distributed representations”… ei-3 ei-2 ei-1

combine these representations… C = f

matrix-vector product

ew

Neural Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝑤𝑖⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))

create/use “distributed representations”… ei-3 ei-2 ei-1

combine these representations… C = f

matrix-vector product

ew

θwi

Neural Language Models

predict the next word

given some context…wi-3 wi-2

wi

wi-1

compute beliefs about what is likely…

𝑝 𝑤𝑖 𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1) ∝ softmax(𝜃𝑤𝑖⋅ 𝒇(𝑤𝑖−3,𝑤𝑖−2,𝑤𝑖−1))

create/use “distributed representations”… ei-3 ei-2 ei-1

combine these representations… C = f

matrix-vector product

ew

θwi

“A Neural Probabilistic Language Model,” Bengio et al. (2003)

Baselines

LM NameN-

gramParams.

Test Ppl.

Interpolation 3 --- 336

Kneser-Ney backoff

3 --- 323

Kneser-Neybackoff

5 --- 321

Class-based backoff

3500

classes312

Class-based backoff

5500

classes312

“A Neural Probabilistic Language Model,” Bengio et al. (2003)

Baselines

LM NameN-

gramParams.

Test Ppl.

Interpolation 3 --- 336

Kneser-Ney backoff

3 --- 323

Kneser-Neybackoff

5 --- 321

Class-based backoff

3500

classes312

Class-based backoff

5500

classes312

NPLM

N-gramWord Vector Dim.

Hidden Dim.

Mix with non-

neural LM

Ppl.

5 60 50 No 268

5 60 50 Yes 257

5 30 100 No 276

5 30 100 Yes 252

“A Neural Probabilistic Language Model,” Bengio et al. (2003)

Baselines

LM NameN-

gramParams.

Test Ppl.

Interpolation 3 --- 336

Kneser-Ney backoff

3 --- 323

Kneser-Neybackoff

5 --- 321

Class-based backoff

3500

classes312

Class-based backoff

5500

classes312

NPLM

N-gramWord Vector Dim.

Hidden Dim.

Mix with non-

neural LM

Ppl.

5 60 50 No 268

5 60 50 Yes 257

5 30 100 No 276

5 30 100 Yes 252

“we were not able to see signs of over- fitting (on the validation set), possibly because we ran only 5 epochs (over 3 weeks using 40 CPUs)” (Sect. 4.2)